Skip to Book Content
Book cover image

Chapter 16 - Heuristics for Improving Generalization

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

16.1 Early Stopping

Figure 14.9 shows that generalization performance can vary with time during training. When the network is underconstrained, the generalization error may reach a minimum but then increase as the network fits peculiarities of the training set that are not characteristic of the target function. One approach to avoid overfitting is to monitor the generalization error and stop training when the minimum is observed. The generalization error is commonly estimated by simple cross-validation with a holdout set although more sophisticated estimates may be used. In [243], [371], [337], the generalization ability of the network is estimated based on its pre-and post-training performance on previously unseen training data. Early-stopping is compared to a number of other nonconvergent training techniques by Finnoff, Hergert, and Zimmermann [123], [122]. A practical advantage of early stopping is that it is often faster than training to complete convergence followed by pruning.

Although early stopping can be effective, some care is needed in deciding when to stop. As noted in section 14.5.3, the validation error surface may have local minima that could fool simple algorithms into stopping too soon [16]. The generalization vs. time curve may also have long flat regions preceding a steep drop-off [16]. It should also be noted that figure 14.9 represents an idealized situation; the training curves are often noisy and may need to be filtered. A simple way to avoid many of these problems is to train until the network is clearly overfitting, retaining the best set of weights observed along the trajectory.

Although early stopping helps prevent overfitting, the results apply only to the chosen network. To achieve the best possible generalization, it is still necessary to test other network configurations and additional criteria will probably be needed to choose among them. The fact the overtraining is not observed in one training trial does not mean that it will not occur in another and is not proof that a suitable network size has been selected.

It can be argued that part of the reason for the relative success of back-propagation with early stopping is that it has a built-in bias for simple solutions because, when initialized with small weights, the network follows a path of increasing complexity from nearly constant functions to linear functions to increasingly nonlinear functions. Training is normally stopped as soon as some nonzero error criterion is met, so the algorithm is more likely to find a simple solution than a more complex solution that gives the same result. Cross-validation is a method for comparing solutions, but stopping when the validation error is minimum takes advantage of these special dynamics. It may be less effective for systems initialized with large weights or second-order algorithms that make large weight changes at each iteration.