Having obtained the derivatives, the next step is to update the weights so as to decrease the error. As noted earlier, the term back-propagation refers to (1) an efficient method to calculate the derivatives ∂E/∂w and (2) an optimization algorithm that uses those derivatives to adjust the weights to reduce the error. Having obtained the derivatives, we have the choice of continuing with back-propagation, the optimization algorithm, or using one of many alternative optimization methods that may be better adapted to the given problem.
Back-propagation (the optimization method) is basically equivalent to gradient descent. By definition, the gradient of E points in the direction that increases E the fastest. In order to minimize E, the weights are adjusted in the opposite direction. The weight update formula is
where the learning rate η > 0 is a small positive constant. Sometimes η is also called the step size parameter. If the derivative is positive (so increases in w causes increases in E) then the weight change is negative and vice versa. This approaches pure gradient descent when η is infinitesimal. Very small η values mean very long learning times though so larger rates are usually used. Typical values are in the range 0.05 < η < 0.75. (This is just a rule of thumb, however; see section 6.1 for more discussion.)
The network is usually initialized with small random weights. Values are often selected uniformly from a range [-a, +a] where 0.1 < a < 2 typically. Random values are needed to break symmetry while small values are necessary to avoid immediate saturation of the sigmoid nonlinearities. Chapter 7 considers initialization methods in more detail.
There are two basic weight-update variations, batch-mode and on-line. In batch-mode, every pattern p is evaluated to obtain the derivative terms ∂Ep/∂w; these are summed to obtain the total derivative∂Ep/∂w terms are obtained by application of the method of section 5.2 to each pattern p.
The basic steps are
For every pattern p in the training set,
apply pattern p and forward propagate to obtain network outputs, and
calculate the pattern error Ep and back-propagate to obtain the single-pattern derivatives∂Ep/∂w.
Add up all the single-pattern terms to get the total derivative.
Update the weights
Each such pass through the training set is called an epoch.
The gradient is calculated exactly and weight changes are proportional to the gradient so batch-mode learning approximates gradient descent when the step size η is small (figure 5.4). In general, each weight update reduces the error by only a small amount so many epochs are needed to minimize the error.
An alternative to batch-mode is on-line or pattern-mode learning. In on-line learning, the weights are updated after each pattern presentation. Generally, a pattern p is chosen at random and presented to the network. The output is compared with the target for that pattern and the errors are back-propagated to obtain the single-pattern derivative ∂Ep/∂w. The weights are then updated immediately, using the gradient of the single-pattern error. Generally, the patterns are presented in a random, constantly changing order to avoid cyclic effects.
The steps are:
An advantage of this approach is that there is no need to store and sum the individual ∂Ep/∂w contributions; each pattern derivative is evaluated, used immediately, and then discarded. This may make hardware implementation easier when resources are limited. Another possible advantage is that many more weight updates occur in a given amount of time. If the training set contains M patterns, for example, on-line learning would make M weight changes in the time that batch-mode learning makes only one.
A possible disadvantage (from an analysis standpoint, at least) is that this is no longer a simple approximation to gradient descent. Figure 5.5 illustrates the relationship between the true gradient and the single-pattern terms in a typical case. The single-pattern derivatives can be viewed as noisy estimates of the true gradient. As a group, they sum to the gradient but each has a random deviation that need not be small. When the gradient is strong, the average single-pattern derivative has a positive projection on the gradient (because it is the sum of the single-pattern terms) so the error usually decreases after most weight changes. Still, there may be terms with negative projections or large orthogonal deviations which may cause the error to increase after some updates. On average though, the weight change will at least move downhill even if it doesn't take the most direct path.
On-line learning differs from pure gradient descent in that the sum (5.24) is never evaluated exactly because the weights change after each pattern so the individual terms are evaluated at different points. The difference is minimal when η very small; the weights won't change much between steps so the effect after all patterns have been evaluated is approximately the same as if all terms had been evaluated at a single point and summed to perform a single weight change which has the same overall result.
Very small learning rates tend to make learning very slow, however, so larger values are often used and the stochastic elements become important. Instead of following a smooth trajectory down the gradient, the weight vector tends to jitter around the E(w) surface, mostly moving downhill, but occasionally jumping uphill (figure 5.6). To a first approximation, the magnitude of the jitter is proportional to the learning rate η. The randomness arises because training patterns are selected in a random, constantly changing order. Cyclic, fixed orders are generally avoided because of the possibility of convergence to a limit cycle (figure 5.7).
This randomness has advantages and disadvantages. On the plus side, it gives the algorithm some stochastic search properties. When pure gradient descent arrives at a local minimum, it is simply stuck. In on-line (per-pattern) learning, however, the weight state tends to jitter around its equilibrium value. Instead of sitting quietly at the minimum, it visits many nearby points and occasionally, if chance allows, may bounce out of a poor minimum and find a better solution. Thus, on-line learning may have a better chance of finding a global minimum than true gradient descent. On the minus side, the weight vector never settles to a stable value. Having found a good minimum, it may then wander off. Also on the minus side, when the jitter is very large, it may completely hide any deterministic gradient information so the system may be unable to follow subtle paths in the E(w) surface.
It is common, therefore, to adjust the learning rate as training progresses. The simplest scheme is to start with an intermediate value, let the system train to approximate convergence, and then gradually reduce the learning rate to zero to allow the system to settle to the minimum. The learning rate can also be adjusted dynamically depending on conditions encountered during training. It is desirable to maintain a balance between stochastic search and efficient progress down the error gradient. The learning rate should not be so large that any single weight update (or likely sequence of updates) can move the weight state to a completely new area of the weight space, but it should not be so small that the system merely approximates gradient descent. If a large majority of single-pattern vectors have a common direction (positive projection on the average vector, the gradient) then the learning rate can probably be increased. If the single-pattern vectors have no apparent common direction then the learning rate should be reduced. This will occur near a minimum, where the gradient goes to zero because the single-pattern vectors cancel. It may also occur at the bottom of a "ravine" in the error surface, where the single-pattern vectors often group into two bundles pointing in opposite directions across the long axis of the ravine. In these cases, smaller learning rates would allow the system to settle to the minimum. Section 9.8 describes the "search then converge" algorithm, an adaptive learning method that controls the learning rate automatically.
A side note about terminology: The label "on-line learning" may be confusing because it implies that learning may occur in the field during normal operation and that it is not necessary to take the system off-line for training. But on-line learning, like batch-mode learning, is normally done off-line during a separate training phase with controlled data sets. The label "pattern-mode learning" is sometimes used instead.