Skip to Book Content
Book cover image

Chapter 8 - The Error Surface

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

8.1 Characteristic Features

Stair-Steps For classification problems the error surface often has a "stair-step" quality with flat regions separated by steep cliffs (figure 8.1a). The stair-step shape can arise because samples in finite training sets are sparse and because the classifier output changes sharply at a decision boundary in the input space. The decision boundary moves in the input space as the weights change but the error remains constant until the boundary crosses over a training sample and alters its classification. Either the reclassified sample is now classified correctly and the error drops a step, or the sample is now classified incorrectly and the error jumps a step. The E(w) surface thus has flat areas where E doesn't change, separated by vertical steps where E changes discontinuously as the boundary crosses over a sample in the input space.

As the number of samples increases, the steps become more numerous and closer together (figure 8.1b); from a distance, the surface appears smoother. With continuous training data (samples available everywhere), many of the flat areas may disappear. The error can change continuously even if the node nonlinearity is a step function because the volume of positive and negative samples changes continuously as the boundary moves. Discontinuities may still occur though at points in the E(w) space where the system decision boundary is parallel to and crosses a true boundary in the data.

The E(w) surface also becomes smoother as the node nonlinearities of the classifier become smoother. With linear threshold units (step function nonlinearities) the plateaus of the error surface are truly flat and the steps between plateaus are truly discontinuous. When the step functions are replaced by smoother functions such as sigmoids, the steps are rounded and error surface is smoother. Figure 8.2 shows the smoothing effect of using a lower gain sigmoid. Indeed, one of the main reasons for using sigmoids rather than step functions is that the error surface becomes continuous so gradient based optimization methods can be used. Since gain scaling is equivalent to scaling all the node input weights by a constant factor (smaller weights correspond to smaller gains) this provides support for the heuristic of initializing with small weights.

Click To expand
Figure 8.1: (a) The error surface of a classifier often has many flat plateaus separated by steep cliffs. (b) Increasing the number of samples creates more steps and moves them closer together (adapted from [181], [183]).

The orientation and placement of a node hyperplane depends only on the ratio of its weights (section 3.1). As all weights are scaled equivalently, the location of the hyperplane stays fixed but the steepness of the sigmoid transition varies (increasing with the magnitude of the weight vector). For the system as a whole, the input-output transfer function is defined by cells bounded by the node hyperplanes; as all weights are scaled equivalently, the cell boundaries remain fixed but the steepness of the boundary transitions change. For small scale factors (small weights), the sigmoids have small slopes and the boundary transition regions may extend across entire cells, effectively smoothing over the stair steps. As the scale factor becomes large (large weights), the boundary transition regions shrink, the cell interiors flatten, and the steps become sharper.

Radial Features For classification problems, the preceding means the E(w) surface often has a radial or "star" topology because scaling all weights equivalently corresponds to moving along a radial line in weight space from the origin to infinity.

The surface is not truly "star-shaped" because the error can change nonmonotonically along a radial line in the region near the origin. Past a certain radius, however, the classifications cease changing as the weights increase further. Once the scale factor is large enough, the classifications remain essentially constant and the error changes very little as the weight state moves along a line to infinity.

Figure 8.2: With a smaller tanh gain (0.1 in this case), the step transitions are smoother (cf.figure 8.1a). Gain scaling is equivalent to scaling all weights by a constant factor though so this does not change the basic shape of the error surface. In the figure, it corresponds to zooming in for a closer view of the origin.

For {0, 1} training targets (or {-1, +1} targets for tanh node functions), the minimum error on the line often occurs at infinity because the target values are reachable only by making the weights approach infinity. (This is generally true for single layer networks and linearly separable data; there may be exceptions for multilayer networks, data sets which are not linearly separable, or data sets for which the optimal outputs are not 0 and 1 even though the target values are.) The error surface therefore often has rays or troughs extending radially from the origin with minima (or maxima) at infinity. Because the sigmoid slopes are extremely small in the saturation region, the slope along the bottom of the trough is also very small. Although it is not visible in figure 8.1a, there is a trough along the center of the lowest plateau.

Replacing the {0, 1} targets with {0.1, 0.9} values may move the minima in from infinity, but this may also introduce new minima in the form of small dips at the bottom of each cliff (figure 8.3); these are usually shallow and narrow, however. Consider how the error varies as a sigmoid is shifted sideways by varying the threshold. As the 0.9 part of the sigmoid passes over a 0.9 target, the error for that sample goes through zero. This creates a local minimum (if other samples are sufficiently far away) because the error increases as the sigmoid is shifted to either side around this point. In the two-dimensional plots, this appears as a small gutter or trough along the bottom of each step. Similar effects can also occur at step tops in networks with hidden layers. One way to suppress the gutter is to change the error function so that outputs greater than 0.9 (for target values t = 1) and less than 0.1 (for t = 0 and sigmoid nodes) do not contribute to the error [181], [183]. (In section 8.5 this is called the LMS-threshold error function.) This could introduce truly flat plateau regions, however, causing problems for gradient-based training methods.

Figure 8.3: Replacing the {-1, 1} targets with {-0.9, 0.9} values produces a small dip at the bottom of each cliff. For illustration purposes, the tanh gain was reduced to 1/2 to make the dip wider and smaller targets were used to make it deeper.

Troughs and Ridges More significant troughs and ridges occur when the classifier cannot completely separate the training samples. Figure 8.4 shows the error surface for the two-weight classifier given a training set that is not linearly separable. The input data is one-dimensional (points on a line). As the threshold weight varies, the sigmoid shifts along the input axis and the error increases and then decreases again as the decision boundary crosses individual samples. This occurs for all values of the gain weight so the result is a trough in the error surface. A gradient based optimization method could easily get stuck in one of these troughs and so converge to a poor solution.

It is interesting to note that, in figure 8.4 at least, the troughs come together at the origin. This supports the idea of initializing with small weights, that is, near the origin, where all troughs (including the main basin) are reachable in just a few steps. Although true gradient following methods would not be able to escape from a poor trough, ap- proximations such as on-line or batch back-propagation with a noninfinitesimal step size would have an appreciable chance if better alternatives are sufficiently close. Of course, we should not jump to conclusions based on this one example; in many problems the origin is a local minimum and for these it may be better to initialize at some intermediate distance.

Figure 8.4: When the samples are not linearly separable, the error surface has radial troughs and ridges. A gradient based optimization method could easily get stuck in one of the troughs corresponding to a poor solution (adapted from [181], [183]).