### Chapter 5 - Back-Propagation

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II

## 5.3 Back-Propagation: The Weight Update Algorithm

Having obtained the derivatives, the next step is to update the weights so as to decrease the error. As noted earlier, the term back-propagation refers to (1) an efficient method to calculate the derivatives E/w and (2) an optimization algorithm that uses those derivatives to adjust the weights to reduce the error. Having obtained the derivatives, we have the choice of continuing with back-propagation, the optimization algorithm, or using one of many alternative optimization methods that may be better adapted to the given problem.

Back-propagation (the optimization method) is basically equivalent to gradient descent. By definition, the gradient of E points in the direction that increases E the fastest. In order to minimize E, the weights are adjusted in the opposite direction. The weight update formula is

 (5.23)

where the learning rate η > 0 is a small positive constant. Sometimes η is also called the step size parameter. If the derivative is positive (so increases in w causes increases in E) then the weight change is negative and vice versa. This approaches pure gradient descent when η is infinitesimal. Very small η values mean very long learning times though so larger rates are usually used. Typical values are in the range 0.05 < η < 0.75. (This is just a rule of thumb, however; see section 6.1 for more discussion.)

The network is usually initialized with small random weights. Values are often selected uniformly from a range [-a, +a] where 0.1 < a < 2 typically. Random values are needed to break symmetry while small values are necessary to avoid immediate saturation of the sigmoid nonlinearities. Chapter 7 considers initialization methods in more detail.

### 5.3.1 Batch Learning

There are two basic weight-update variations, batch-mode and on-line. In batch-mode, every pattern p is evaluated to obtain the derivative terms Ep/w; these are summed to obtain the total derivative

 (5.24)
and only then are the weights updated. This comes from the derivative rule for sums. The individual Ep/w terms are obtained by application of the method of section 5.2 to each pattern p.

The basic steps are

• For every pattern p in the training set,

1. apply pattern p and forward propagate to obtain network outputs, and

2. calculate the pattern error Ep and back-propagate to obtain the single-pattern derivativesEp/w.

• Add up all the single-pattern terms to get the total derivative.

• Update the weights

• Repeat.

Each such pass through the training set is called an epoch.

The gradient is calculated exactly and weight changes are proportional to the gradient so batch-mode learning approximates gradient descent when the step size η is small (figure 5.4). In general, each weight update reduces the error by only a small amount so many epochs are needed to minimize the error.

### 5.3.2 On-Line Learning

An alternative to batch-mode is on-line or pattern-mode learning. In on-line learning, the weights are updated after each pattern presentation. Generally, a pattern p is chosen at random and presented to the network. The output is compared with the target for that pattern and the errors are back-propagated to obtain the single-pattern derivative Ep/w. The weights are then updated immediately, using the gradient of the single-pattern error. Generally, the patterns are presented in a random, constantly changing order to avoid cyclic effects.

The steps are:

Pick a pattern p at random from the training set,
apply pattern p and forward propagate to obtain network outputs, and
calculate the pattern error Ep and back-propagate to obtain the single-pattern derivatives Ep/w.
Update the weights immediately using the single-pattern derivative

Repeat.

An advantage of this approach is that there is no need to store and sum the individual Ep/w contributions; each pattern derivative is evaluated, used immediately, and then discarded. This may make hardware implementation easier when resources are limited. Another possible advantage is that many more weight updates occur in a given amount of time. If the training set contains M patterns, for example, on-line learning would make M weight changes in the time that batch-mode learning makes only one.

A possible disadvantage (from an analysis standpoint, at least) is that this is no longer a simple approximation to gradient descent. Figure 5.5 illustrates the relationship between the true gradient and the single-pattern terms in a typical case. The single-pattern derivatives can be viewed as noisy estimates of the true gradient. As a group, they sum to the gradient but each has a random deviation that need not be small. When the gradient is strong, the average single-pattern derivative has a positive projection on the gradient (because it is the sum of the single-pattern terms) so the error usually decreases after most weight changes. Still, there may be terms with negative projections or large orthogonal deviations which may cause the error to increase after some updates. On average though, the weight change will at least move downhill even if it doesn't take the most direct path.

On-line learning differs from pure gradient descent in that the sum (5.24) is never evaluated exactly because the weights change after each pattern so the individual terms are evaluated at different points. The difference is minimal when η very small; the weights won't change much between steps so the effect after all patterns have been evaluated is approximately the same as if all terms had been evaluated at a single point and summed to perform a single weight change which has the same overall result.

Very small learning rates tend to make learning very slow, however, so larger values are often used and the stochastic elements become important. Instead of following a smooth trajectory down the gradient, the weight vector tends to jitter around the E(w) surface, mostly moving downhill, but occasionally jumping uphill (figure 5.6). To a first approximation, the magnitude of the jitter is proportional to the learning rate η. The randomness arises because training patterns are selected in a random, constantly changing order. Cyclic, fixed orders are generally avoided because of the possibility of convergence to a limit cycle (figure 5.7).

This randomness has advantages and disadvantages. On the plus side, it gives the algorithm some stochastic search properties. When pure gradient descent arrives at a local minimum, it is simply stuck. In on-line (per-pattern) learning, however, the weight state tends to jitter around its equilibrium value. Instead of sitting quietly at the minimum, it visits many nearby points and occasionally, if chance allows, may bounce out of a poor minimum and find a better solution. Thus, on-line learning may have a better chance of finding a global minimum than true gradient descent. On the minus side, the weight vector never settles to a stable value. Having found a good minimum, it may then wander off. Also on the minus side, when the jitter is very large, it may completely hide any deterministic gradient information so the system may be unable to follow subtle paths in the E(w) surface.