8.2 The
Gradient is the Sum of Single-Pattern Gradients
With an SSE or MSE cost function, the E(w) surface is
the sum (or average) of the individual surfaces for each pattern and the total
gradient is the sum (or average) of the single-pattern gradients. In other
words, the error is shaped by the interaction of the weights with each of the
individual training patterns. Figure
8.5 shows single-pattern gradients for a simple two-weight problem. These
are the vectors that would be used for weight updates in on-line learning. On a
"hillside" (a), most of the vectors point in a dominant direction. On a "ridge"
(b) or at the bottom of a "valley" (c), there are often two bundles of vectors
pointing in opposite directions across the valley. In on-line learning, the
weights are updated from just one pattern and thus tend to oscillate across the
valley. At a local minima (d) the vectors sum to zero; they may be large and
distributed in all directions, or they may all go to zero. If they simply cancel
without going to zero, the minimum will be unstable with on-line learning—the weight vector will move off the minimum if placed there.
Point (e) shows a relatively "flat spot". These examples aren't universal since
similar E(w) features could be created in many ways, but they
are common. Other cost functions may yield different behavior.