Skip to Book Content
Book cover image

Chapter 9 - Faster Variations of Back-Propagation

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology

9.3 Delta-Bar-Delta

Jacobs' delta-bar-delta algorithm [194] is one of the more often mentioned acceleration methods. Although some newer methods seem to perform better, it is well-known and many other methods are based on similar ideas. It is based on four heuristics:

  1. Every parameter should have its own learning rate. It is not reasonable for every parameter to have the same learning rate because of differences in scaling, variance, and so on in different parts of the network.

  2. Every learning rate should be allowed to vary over time because local properties of the error surface change as the weight vector moves over it. Learning rates that are appropriate in one area may not be appropriate in other areas.

  3. The learning rate can be increased when the partial derivative of the error has the same sign over several steps. This tends to mean that the error surface has a small curvature and continues to slope in the same direction for some distance so it should be safe to increase the step size.

  4. The learning rate should be decreased when the partial derivative changes sign several times in a row. This tends to mean that the weight vector is bouncing back and forth across a minimum and corresponds to high curvature in the error surface along that direction.

These heuristics lead to the following adjustment rule. Each weight w has its own learning rate η(t), which is adjusted after each epoch according to

(9.2)

where at time t¯δ is the exponential average of past values of δ

(9.3)

(Note, this δ is not the δ used in back-propagation.) The learning rate is incremented by a constant k when δ and ¯δ have the same sign in consecutive iterations and it is decremented by a fraction of its current value when they have different signs. Note that the increase is linear while the decrease is exponential. The learning rate increases gradually when many consecutive steps all move in the same direction, but decreases quickly when conditions change.

As in normal back-propagation, the weight update is

(9.4)

This is no longer equivalent to gradient descent on the error surface, however, because each weight has its own learning rate. In effect, the weights are updated based on partial derivatives plus estimates of curvature.

Typical parameter values are obtained from simulation results for several small problems reported by Jacobs [194]. Initial learning rates were ηo = 0.8 to 1. Typical parameter values were k = 0.03 to 0.1 and fφ = 0.1 to 0.3 depending on the problem. Harder problems seem to require smaller values of k and larger values of φ. This corresponds to a cautious policy: small increases in learning rate when things are going well and large decreases when things go badly. The averaging parameter 0 < θ > 1 does not seem to be critical, θ = 0.7 was used in all cases. Larger values, approaching 1, give longer averaging times.

It is noted that these heuristics can fail in certain cases. For instance, the ideal situation would be to have separate learning rates for each direction identified by an eigenvector of the local Hessian matrix. Instead, it has separate learning rates for each of the coordinate directions in the E(w) space. In the case of a ravine oriented 45° to two weight axes, for example, these heuristics cause the learning rates of both weights to decrease when the best option would be for them to increase together. Because the method is based on local computations only, the two weight changes cannot be coordinated. When changing one weight, the behavior of other weights is not considered.

In one empirical test [9], delta-bar-delta was among the fastest methods to learn to classify correctly (with all outputs within a loose tolerance of the desired values) but it was slow to reduce the error to very small values. In [239], it was slower than standard back-propagation with a carefully selected learning rate. The time difference was relatively small, however, and the adaptive method would probably be faster if the time spent in tuning the learning rate for standard back-propagation were included.

According to some reports, delta-bar-delta seems to be more sensitive to parameters than Rprop or quickprop. That is, the default values (k, φ, θ) may work reasonably well on easy problems, but different parameters may be needed on hard problems and it may not be easy to find a good set.

Section 9.4 summarizes a similar method using multiplicative weight increases and momentum. Both are said to be implementations of heuristics proposed by Sutton [363]. Minai and Williams [266] describe an extended delta-bar-delta algorithm that adapts the momentum as well. There are more parameters to be tuned, however.

9.3.1 Justification

Justification for the seemingly ad hoc heuristic of basing the learning rate changes on the signs of successive partial derivatives and can be found in [194] and [160:194]. Assuming a single output node y for simplicity, the mean squared error at epoch t is

(9.5)

The brackets <> denote the mean over the training set and are dropped in what follows. The derivative of the error with respect to the learning rate ηij can be written

(9.6)

Where ai (tj wij(t)yj(tis the weighted-sum input to node i, yi(t)=f(ai(t)) is the node output, and f is the node nonlinearity, for example, the sigmoid function. Because

(9.7)

we have

(9.8)

Differentiation with respect to ηij(t) gives

(9.9)

From the back-propagation derivation, equation 5.10, we know

(9.10)

and

(9.11)

Combining these results allows (9.6) to be rewritten

(9.12)

This says that the derivative of the error with respect to the learning rate ηij is the negative of the product of the present and previous derivatives of the error with respect to the weight wij. Rather than being an ad hoc heuristic, this is actually a well-founded way of doing gradient descent on the error with respect to the learning rate. The delta-bar-delta update rule (9.2) modifies this slightly by smoothing .