The focus in this chapter has been on training speed.
Generalization is a different issue and the fastest training method will not
always give the best generalization. At best, speed of learning and quality of
generalization are orthogonal issues—completely
independent— and a fast training method would achieve
the same generalization as another method except it would get there faster. In
the best case, a fast training method will simply arrive sooner at the point
where cross-validation says training should stop. Of course, if no specific
steps are taken to ensure good generalization, then a fast method might
generalize worse than a slower method as it may have more chance to overfit in
the same amount of time.
There have been suggestions that some of the faster methods
generalize worse than slower methods [8, 9], but this has not been studied
much. There is some reason to expect techniques that take long steps (e.g.,
Newton's method) to generalize less well because they may go well past the
point of overfitting before it can be detected by cross-validation on a test
set. This does not have to occur, however, and it can be addressed by methods
such as weight decay, pruning, and regularization penalty terms.