8.4
Remarks
The efficiency of any optimization method depends on having a good
fit between the basic assumptions of the algorithm and the actual
characteristics of the function being minimized. Many advanced optimization
methods assume the error surface is locally quadratic, for example, and may not
do well on the "cliffs-and-plateaus" surface common in neural network
classifiers. In this case, the quadratic assumption is not reasonable on a large
scale so these optimizers may be no more efficient than simpler methods in
finding a good approximate solution. The assumption, however, is usually reasonable near a minimum, in which case
these methods may be very efficient for final tuning of a near-solution found by
other methods.
For back-propagation, a large learning rate is needed to make
progress across the large flat regions. But near the "cliffs" where ||∂ E/∂w || is large, a small learning rate is
necessary to prevent huge weight changes in essentially random directions. If a
fixed learning rate is used, the value will have to be a compromise. One of the
advantages of the common technique of initializing with small random weights is
that the system starts in the area near the origin where the error surface is
smoother and it has a better chance of finding the right trough.
Before ending this discussion, it should be noted that the error
surfaces illustrated in the figures are for classification problems with small
training sets. The error surfaces may be very different for regression problems
with large sample sizes. Based on figure 8.1, it is reasonable to expect it to be smoother.
For regression problems where the target is a continuous function
of its inputs, smooth input-output functions are usually preferred. If there is
sufficient data that the system cannot fit every point exactly, then it must
approximate multiple points by fitting a surface "close" to them. For many cost
functions, this surface can be thought of as the local average of nearby points
and will tend to be smooth because of the smoothing effects of averaging.
Because smoother functions generally correspond to smaller weights, the good
minima will usually be in the interior of the weight space rather than at
infinity. Similarly, in classification problems with many samples in overlapping
clusters, it may be better to form gradual transitions between classes in
regions where they overlap. This again corresponds to smaller weights and moves
the minima in from infinity.
Of course, if there is so little data and the network is so
powerful that it can fit every point exactly, then there is so reason to expect
it to form a smooth function, and minima at infinity may survive. Even when a
smooth function would be preferable, local minima at infinity may survive
corresponding to fitting a few of the points exactly while ignoring the rest;
these are likely to be shallow and narrow, however, and will probably be
shadowed by better minima closer to the origin. Although plateaus and cliffs
will be apparent at large distances from the origin, this may be irrelevant
because those regions will never be investigated by the learning algorithm.
The stair-step shape may survive in very underconstrained
networks that can essentially classify each of the training points internally
(e.g., by assigning a hidden "grandmother node" to each training sample). In
this case, the global minima of the training set error would be at infinity and
low generalization-error areas corresponding to smooth functions are unlikely to
be minima.