Skip to Book Content
Book cover image

Chapter 9 - Faster Variations of Back-Propagation

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

9.10 Other Heuristics

9.10.1 Gradient Reuse

Hush and Salas [182] suggest stepping along the line of the computed gradient as long as the error continues to decrease. This is similar to Cauchy's method (section 10.5.2), but it does not search for the exact minimum on the line. It uses fixed size steps along the line rather than, say, bisection search. As in Cauchy's method, there is a savings since the gradient calculation is avoided for each successful step along the line. The step size is increased when the reuse rate is high (indicating that steps are too small) and it is decreased when it's low (because the step size it too large).

9.10.2 Gradient Correlation

Franzini [126], Chan and Fallside [67] and Schreibman and Norris [336] describe gradient correlation methods that monitor the angle between successive gradient vectors to control the learning rate. An advantage of this approach is that a major change in gradient direction can be detected and the learning rate reduced before taking a step, thus reducing the need to retract bad steps.

The gradient correlation measures the cosine of the angle between successive values of the gradient g(t)

(9.25)

When the vectors are nearly parallel, cosθ +1 and the learning rate can probably be increased. When cos θ<0, the gradient has doubled back on itself to some extent and the learning rate should be decreased.

In [126] the learning rate is adjusted according to

(9.26)

where values of β+ = 1.005 and β- = 0.8 are suggested. This tends to keep η near the maximum value such that successive gradients are nearly parallel and eliminates the oscillatory cross-stitching behavior in ravines of the error surface. In the single-problem benchmark [9], this method was slightly slower than standard back-propagation to learn to classify the training set, but when used with momentum, it was the fastest method to reduce the error to near zero. Removal of the cos θ term from the η adjustment rule was suggested.

In [336], the learning rate switches between high and low values based on the correlation. It is reduced to its minimum value (0.01) and momentum is set to 0 as soon as the

correlation became negative. The momentum returns to its normal value (0.9) gradually. Modifications may be needed to apply the idea in practice. In [9] it was slow to learn to classify correctly and could not further reduce the error to small values in the given amount of time, but generalization was said to be good.

9.10.3 Pattern Weighting Heuristics

A number of heuristics attempt to focus attention on the patterns with the worst errors. Often, this can be viewed as a modification of the error function to one which gives more emphasis to larger errors. If attention is focused only on the pattern with the largest error, the ideal result is to minimize the maximum error. Cater [65] gives each pattern a different weighting. Basically, the method identifies the pattern with the worst error and roughly doubles its learning rate in the next epoch.

The heuristic of "learn only if misclassified," used in [329] and later work, says that the actual output values do not really matter for classification problems as long as the classification is unambiguous. A tolerance band is defined and the error is considered to be zero for all outputs within the band. If the target is 0 and the output is 0.06, for example, the classification is obvious and there is no need to adjust the weights for this pattern.

Many methods like this can be considered as modifications of the error function and will lead to different solutions from the mean-squared-error function, in general. This may be a drawback if you actually want to optimize the mean-squared-error, but this normally is not the case for classification problems.