| |||||
| |||||
The focus in this chapter has been on training speed. Generalization is a different issue and the fastest training method will not always give the best generalization. At best, speed of learning and quality of generalization are orthogonal issues—completely independent— and a fast training method would achieve the same generalization as another method except it would get there faster. In the best case, a fast training method will simply arrive sooner at the point where cross-validation says training should stop. Of course, if no specific steps are taken to ensure good generalization, then a fast method might generalize worse than a slower method as it may have more chance to overfit in the same amount of time.
There have been suggestions that some of the faster methods generalize worse than slower methods [8, 9], but this has not been studied much. There is some reason to expect techniques that take long steps (e.g., Newton's method) to generalize less well because they may go well past the point of overfitting before it can be detected by cross-validation on a test set. This does not have to occur, however, and it can be addressed by methods such as weight decay, pruning, and regularization penalty terms.
Most of these methods are for batch mode training. Chen and Mars [76] describe an adaptive stepsize algorithm said to be suitable for on-line training. Modifications may be required, however, and tuning may be difficult [9].