Skip to Book Content
Book cover image

Chapter 14 - Factors Influencing Generalization

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

14.5 The Learning Algorithm

14.5.1 Limitations of the Learning Algorithm

The size and structure of the network put an upper bound on its representational ability. Limitations of the learning algorithm, however, may prevent that potential from being realized. Some techniques predict generalization performance based on static network properties such as size and assume the learning algorithm will be powerful enough to find a solution if one exists. Of course, most learning algorithms are not perfect and may fail to find some solutions in any reasonable amount of time. Problems such as local minima and speed of convergence must be considered in practice.

14.5.2 Bias of the Learning Algorithm

All learning algorithms (except perhaps pure exhaustive search over an unlimited solution space) have biases. In many cases, bias is introduced in the form of heuristics that are not rigorously justifiable but which seem to help in practice. In any case, the bias helps generalization by favoring likely solutions over unlikely solutions. Of course, every heuristic can fail and if the bias is wrong, performance could suffer. Tailoring the bias to "agree with reality" is one of the most common ways of introducing external constraints necessary for good generalization. Many of the techniques discussed later are simply different ways of doing this.

It could be argued that part of the reason for the success of back-propagation on many problems is that it has a built-in bias for simple solutions. When initialized with small weights, the function computed by the network follows a path of increasing complexity from nearly constant functions to linear functions to more and more nonlinear functions as training continues. Because training is normally stopped as soon as some error criterion is satisfied, the algorithm is more likely to find a simple solution than to find a complex solution that gives the same result. As a result, a large network is not immediately saddled with a high complexity. Of course, back-propagation will not always stop with a simple solution if training continues.

14.5.3 Learning Dynamics: Overtraining/Overfitting

Generalization performance varies over time as the network adapts during training (see figure 14.9). A randomly selected initial configuration is likely to be completely inconsistent with the examples so both the training set and generalization errors are likely to be high before learning begins. During training, the network adapts to decrease the error on the training patterns. In the early stages of learning, the generalization error tends to decrease in step with the training error as the network captures the major features of the underlying function. If the training data are noisy or incomplete, however, they may contain misleading regularities. In addition to representing the general properties of the target function, it is likely to contain peculiarities unique to the particular data set and uncharacteristic of the target function. As these idiosyncrasies are exploited in later stages of learning, the improvement in generalization that comes from being right on the training examples is offset by errors (invisible to the learning algorithm) introduced elsewhere and the generalization error begins to increase again even though the training error continues to decrease. Chauvin [73] describes an example of this type of overtraining.

Click To expand
Figure 14.9: As training progresses, the generalization error may decrease to a minimum and then increase again as the network adapts to idiosyncrasies of the training data. Training and test errors are shown for an underconstrained network trained by back-propagation on 30 examples of the iris data and tested on the 120 remaining examples. (An underconstrained net and small learning rate were used to demonstrate overtraining. Smaller networks can learn the data in a much shorter time.)

Thus, for a given underconstrained network, set of training data, and learning algorithm, there may be an optimal amount of training that gives the best generalization. Although further training might decrease the training-set error, it would increase the expected generalization error. The relationship between generalization and overtraining has been examined in many studies, for example [73], [74], [16].

Another way to view the situation is illustrated in figure 14.10. The training and test errors can be thought of as surfaces in the weight space. Neural networks are trained by adapting the weights to minimize the error on the training set. The training error surface is likely to be similar to the true error surface, but distorted somewhat depending on the training data. In particular, the minimum of the training error surface is likely to be displaced from the true minimum. Depending on how the network is initialized, the weight trajectory may pass by a true minimum on its way to the apparent (training error) minimum. The observed generalization error would then show an overall decreasing trend as the network trajectory approaches both minima, followed by an increase as the network passes the true minimum and continues on to the false minimum.

Click To expand
Figure 14.10: One explanation of overtraining is that the training error surface is similar to the true error surface but somewhat distorted so the apparent minimum is offset from the true minimum. The figure shows hypothetical error surfaces. Depending on how the network is initialized, the weight trajectory may pass over a true minimum on its way the apparent (training error) minimum.

Figure 14.10 also illustrates that whether or not overtraining is observed may depend on initial conditions [74] because a trajectory starting on the left side of the figure passes over the true minimum while one approaching from the right side does not. In other words, the fact that overtraining is not observed in one trial does not prove that the training data are adequate and that it will not occur in another trial with a different initialization. Also, since the training error surface may be distorted and offset in different ways depending on the training data, observation of overtraining may depend on peculiarities of different training sets even when the network is always initialized in the same state. The figure also shows that the generalization error may not decrease monotonically to a single minimum [16] and that local minima could confuse simple early-stopping schemes that halt when the validation error first bottoms out.

Example: Effect of Training Algorithms Different training algorithms may give different generalization results depending on their built-in biases and ability to "exploit loopholes" in the training criteria. Figure 14.11 shows input-output responses of a 2/2/1 network trained by various algorithms on the 2-bit XOR problem. Inputs were coded as -1 and +1; targets were -0.9 and 0.9. Tanh nonlinearities were used. The corners of the squares correspond to the four training points. All the networks were trained to 100% correct response with an output considered correct if the error was less than or equal to 0.1. Batch and on-line learning, although not the fastest methods, give reasonably smooth and symmetric responses here; similar inputs would give similar outputs. The networks trained by RProp [315] and conjugate gradient descent, faster methods, for example, show a tendency to asymmetric responses with sharp transitions close to the training points. This might be expected to lead to poor generalization on slightly different inputs; nets h, j, and k in particular would be sensitive to changes in the input data.

Click To expand
Figure 14.11: Different training algorithms may yield different generalization results. Shown are responses for a 2/2/1 net trained by various algorithms on the 2-bit XOR problem. Batch and "on-line" training appear to give reasonably smooth symmetric responses here (values in parentheses are the learning rate and momentum). RProp and conjugate gradient descent (CGD) appear to create less symmetric surfaces and allow sharp transitions near the training points at the corners. Networks (h), (j), and (k) appear sensitive to the training data and are unlikely to generalize well if the input values are changed slightly.

Although responses (a-d) might be preferred because they are symmetric and smooth, these were not among the training criteria. All the nets are equally good on the basis of training error. If smooth, symmetric responses are desired, this information has to be provided to the training procedure in some way.

The figure also illustrates that selection of a minimal network is not sufficient to guarantee good generalization. The 2/2/1 networks are nearly minimal (some of the bias connections may not be necessary) for the 2-bit XOR problem unless short-cut connections are allowed, but the instances will probably generalize differently.