14.3
Network Complexity versus Target Complexity
In order to generalize well, a system needs to be sufficiently
powerful to approximate the target function. If it is too simple to fit even the
training data then generalization to new data is also likely to be poor. (The
true error may not be much worse than the training error, however, depending on
how well the training data represents the target function.) If the network is
powerful enough then good generalization is at least possible if not limited by
other factors. In contrast to the rule of thumb that simpler is better, the
larger network may generalize better since it is more powerful and better able
to approximate the true target function. An overly complex system, however, may
be able to approximate the data in many different ways that give similar errors
and is unlikely to choose the one that will generalize best unless other
constraints are imposed.
Figure
14.2 illustrates possible under- and overfitting. The fitting function is a
linear combination of M evenly spaced
Gaussian basis functions with width inversely proportional to M. At M = 3, the
approximation is too simple and the error is large. At M = 5, the errors are smaller. At M = 30, the approximation may be overfitting the
data.
Whether a given network overfits or underfits the data depends in
part on the size of the training set. Figure
14.3 shows generalization error versus complexity curves for a slightly more
complex function fitted by the same system of Gaussian basis functions. In
general, the curve for a particular sample size N has a minimum at some intermediate complexity
value M. Below a certain threshold, the
approximation is too simple and all systems have large errors. At high values of
M, the system begins to overfit and the error
increases.
Unfortunately, if the target function is completely unknown,
there is no way to determine a priori if the network is complex enough. Figure
14.2c may be overfitting if the data is noisy and the target function has a
form similar to figure
14.2b, but it could be that the data are clean and the actual function is a
complex deterministic function in which case figure
14.2b may be underfitting. Additional information is needed.