Ideally, the number and distribution of training samples should be such that minimizing the error on the samples gives results identical to minimizing the error on the true function. Poor generalization could arise due to problems in the way the training data is sampled. In general, the training set must be representative of the target function. A poor set of training data may contain misleading regularities not found in the underlying function although, when samples are randomly selected, this becomes less likely as the sample size grows.
For neural networks used as function approximators, generalization usually means interpolation. This does not imply an ability to extrapolate (figure 14.4).
Systems are often developed under conditions slightly different from normal operating conditions and the resulting differences in training and test sample distributions could lead to poor generalization. Approximations using a mean squared error function are sensitive to the sampling density in the sense that the minimizer tends to focus on getting a good fit in densely sampled regions and may ignore large but rare errors in sparsely sampled regions. That is, when infinite numbers of samples were available everywhere, the mean squared error approaches the expected squared error
Errors are weighted by the sampling distribution p(x) so the minimizer will be happy to let errors increase in regions where p(x) is low if it can decrease errors in regions where p(x) is high. Good generalization cannot be expected, therefore, if a network is trained on samples from one region but tested on samples from a completely different region. Most training methods assume training and test samples are drawn from the same distribution and modifications will be required if different distributions are used. Other error measures can be less sensitive to this factor than MSE.
When the target function has important features in low density regions, large sample sizes may be needed to obtain enough samples to adequately represent the function there. Large sample sizes will also be needed to obtain accurate estimates of the sampling density. Even if the training and test data share the same sampling function, a very small training set could give a misleading picture of the true density.
Differences in sampling distributions do not always have to cause bad generalization, however. A system trained on harder than average cases may generalize well even if it does only fair on the training data. Concentration on cases near classification boundaries has been useful in some problems , .
Differences in distribution may also be less important when other information is provided or the network is constrained appropriately. An image recognition system, for example, with a rotation invariant representation can be trained on images in one orientation and tested on images in another because the cases are effectively equivalent under the representation. Sampling differences may also be less important when the target function is simple enough that a fitting function that works well in the densely sampled regions also works in the sparsely sampled regions. That is, if there are so much data that the network is fully constrained everywhere, then small changes in the distribution of the data may have little effect on the solution chosen.
Finally, in some problems, there is the option of choosing the training samples, possibly while learning is in progress. The idea in  is to choose training samples that provide the most new information. This can be a useful strategy for any learning system, including people.
Even when the training and test sampling densities match, generalization may suffer if the training set does not capture all significant features of the underlying target function. If the function is a bandlimited sum of sinusoids, for example, and samples are taken at less than the Nyquist rate, then essential features will be missed and the reconstruction will not generalize well.
Constraining knowledge about the generating function can reduce the number of samples needed to choose a hypothesis function. The knowledge that the function is a bandlimited sum of sinusoids, for example, allows a Nyquist rate to be set. (And further knowledge about the probabilities of likely functions may allow sampling at less than the Nyquist rate .)
In general, the number of examples required depends on the constraining knowledge. If the target function is known to be a polynomial of degree N or less, say, then N + 1 points are sufficient to determine which particular polynomial generated the data. If the target function is, in fact, an Nth order polynomial, then the chosen function should generalize perfectly. Of course, more complex function classes will generally require more samples and, in many cases, the number of samples required may be unreasonably large or unbounded, depending on the constraints. It may also be difficult to calculate the function even if it is known to exist. Finally, the constraining knowledge must be available in a form useful to the training procedure; it is not obvious how knowledge that the target is an Nth order polynomial could be made useful to a standard neural network training procedure, for example.
In general, larger and more complex systems can compute more functions but need more examples to determine which function to choose (one aspect of the so-called curse of dimensionality). If the number of examples is small compared to the number of degrees of freedom of the network, then the network may be able to implement many functions consistent with the examples. Some of these may generalize better than others, but a learning algorithm guided only by errors on the samples has no way to tell and is unlikely to choose the one that generalizes best. As a rule, increasing the number of samples decreases the number of hypothesis functions consistent with the samples, but it will never reduce it to just one function unless the set of hypothesis functions is finite or there are other constraints.
If a network is underconstrained, that is, if it has more degrees of freedom (the number of weights, roughly) than the number of the training samples, then it may use the extra degrees of freedom to fit noise or spurious correlations in the data. Even though it may produce exactly the right output at each of the training points, it may be very inaccurate at other points. An example is a high-order polynomial fitted through a small number of points. Figure 14.5 shows an example of overfitting by an underconstrained network.
As with polynomial approximations, a rule of thumb is to use the smallest system that fits the data. If the system has only a limited number of degrees of freedom, it will usually use them to adapt to the largest regularities in the data and ignore the smaller, possibly spurious, ones. (It should be noted that "degrees of freedom" is not a well-defined concept for nonlinear systems so it may be difficult to say how many samples are necessary to constrain a given network.
Another general rule of thumb is that the number of samples should be several times larger than the number W of weights in the network. A simple perceptron (i.e., a linear threshold unit) with d inputs (W = d + 1) can always separate N < d + 1 arbitrarily labeled points and, if the points are randomly distributed, can separate almost all labelings of 2d or fewer points when d is large . (This is discussed in section 3.3.) Thus, 0(3W) points may be needed to constrain even a simple perceptron. Of course, this simple rule of thumb may not hold for more complex functions and networks.
Figure 14.6 shows generalization error versus sample size N for the Gaussian basis function approximation system of figure 14.3. Higher complexity approximations (with more basis functions M in this case) yield smaller asymptotic errors, but need more samples before the error approaches the asymptotic value. With small sample sizes, lower complexity systems may generalize better. At N = 15, for example, the error of the M = 20 system is higher than that of the M = 10 system.
Figure 14.7 shows training and test set errors versus sample size N for selected M values. For small sample sizes, the training error is small but the test set error is high. As the sample size increases, the training error increases and the test set error decreases until both approach the same asymptotic value at large N. Note that the large M solutions have lower asymptotes, but the small M solutions approach their asymptotes faster. The figure gives some support to the heuristic that the sample size should be several times as large as the number of free parameters since the knee of the M = 10 curve occurs near N = 10.
If the training data are noisy, the unknown target function can only be estimated. No finite number of examples will uniquely determine a fitting function even if the number of hypothesis functions is finite and the best that can be done is to choose among candidates based on something like the mean-squared error or the probability of misclassification. If one knows or assumes properties of the noise, then an optimal estimate can be made in some cases.
In general, if the noise is independent and zero-mean, then errors can be reduced by obtaining more samples since independent errors tend to cancel. (Systematic measurement errors will not be reduced by simply averaging over more data, however.) More samples will be needed to reduce the variance of the estimated error to an equivalent level when noise levels are high.
With a static data set whose values are unchanging, there is no obvious indication that the data are noisy. If noise is not considered and the system is allowed to fit the data exactly, noise and all, then generalization may be poor.
Increased amounts of noise may lead to overfitting at lower complexity levels. Figure 14.8 shows test set error vs. complexity curves for various levels σ of additive noise in the target values. The fitting function is the system of evenly spaced Gaussian functions system of figure 14.3 with training set size N = 20. With increasing amounts of additive noise, the minima of the error curves are higher and occur at lower M values.
Ideally, the class of functions computable by the network would exactly equal the class of the generating function and the number of examples needed to constrain the network would be the same as the number of examples needed to determine the function by other means. This would be a parametric model. Usually however, the target function lies outside the class of functions computed by the network; no set of weights will yield a function that exactly matches the target everywhere so, at best, the network can only approximate the target. In this case, the number of examples needed to determine the best network may differ from the number of samples required to fit a parametric model. Although we need only N + 1 samples to fit a function we know to be an Nth order polynomial, this information is not available to the network and many more samples might be needed to train it. Because the network and target function classes differ, the network output will be an approximation and generalization will be imperfect even if the network fits an arbitrarily large number of data samples exactly.