8.2
The Gradient is the Sum of Single-Pattern Gradients
With
an SSE or MSE cost function, the E(w) surface is
the sum (or average) of the individual surfaces for each pattern and the total
gradient is the sum (or average) of the single-pattern gradients. In other words,
the error is shaped by the interaction of the weights with each of the individual
training patterns. Figure 8.5 shows single-pattern gradients for a simple two-weight
problem. These are the vectors that would be used for weight updates in on-line
learning. On a "hillside" (a), most of the vectors point in a dominant direction.
On a "ridge" (b) or at the bottom of a "valley" (c), there are often two bundles
of vectors pointing in opposite directions across the valley. In on-line learning,
the weights are updated from just one pattern and thus tend to oscillate across
the valley. At a local minima (d) the vectors sum to zero; they may be large and
distributed in all directions, or they may all go to zero. If they simply cancel
without going to zero, the minimum will be unstable with on-line learning—the weight vector will move off the minimum if placed
there. Point (e) shows a relatively "flat spot". These examples aren't universal
since similar E(w) features could be created in many ways, but they
are common. Other cost functions may yield different behavior.