16.1
Early Stopping
Figure 14.9 shows that generalization performance can vary
with time during training. When the network is underconstrained, the
generalization error may reach a minimum but then increase as the network fits
peculiarities of the training set that are not characteristic of the target
function. One approach to avoid overfitting is to monitor the generalization
error and stop training when the minimum is observed. The generalization error
is commonly estimated by simple cross-validation with a holdout set although
more sophisticated estimates may be used. In [243], [371], [337], the generalization ability of the network is estimated
based on its pre-and post-training performance on previously unseen training
data. Early-stopping is compared to a number of other nonconvergent training
techniques by Finnoff, Hergert, and Zimmermann [123], [122]. A practical advantage of early stopping is that it is
often faster than training to complete convergence followed by pruning.
Although early stopping can be effective, some care is needed in
deciding when to stop. As noted in section 14.5.3, the validation error surface may have local
minima that could fool simple algorithms into stopping too soon [16]. The generalization vs. time curve may also have long flat regions
preceding a steep drop-off [16]. It should also be noted that figure 14.9 represents an idealized situation; the training
curves are often noisy and may need to be filtered. A simple way to avoid many
of these problems is to train until the network is clearly overfitting,
retaining the best set of weights observed along the trajectory.
Although early stopping helps prevent overfitting, the results
apply only to the chosen network. To achieve the best possible generalization,
it is still necessary to test other network configurations and additional
criteria will probably be needed to choose among them. The fact the overtraining
is not observed in one training trial does not mean that it will not occur in
another and is not proof that a suitable network size has been selected.
It can be argued that part of the reason for the relative
success of back-propagation with early stopping is that it has a built-in bias
for simple solutions because, when initialized with small weights, the network
follows a path of increasing complexity from nearly constant functions to linear
functions to increasingly nonlinear functions. Training is normally stopped as
soon as some nonzero error criterion is met, so the algorithm is more likely to
find a simple solution than a more complex solution that gives the same result.
Cross-validation is a method for comparing solutions, but stopping when the validation error is minimum takes
advantage of these special dynamics. It may be less effective for systems
initialized with large weights or second-order algorithms that make large weight
changes at each iteration.