Skip to Book Content
Book cover image

Chapter 9 - Faster Variations of Back-Propagation

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

9.4 Silva and Almeida

Delta-bar-delta is one of the more well-known adaptive learning rate methods. Silva and Almeida [346] proposed a variation using multiplicative weight increases and momentum. Both are said to be implementations of heuristics proposed by Sutton [363]. This is similar to the method of Vogl et al. with a separate learning rate for each weight.

The weight update rule is

(9.13)

where ηij(t) is the learning rate for weight wij at epoch t. The learning rate is adapted at each epoch according to

(9.14)

where constants u > 1 and 0 < d < 1 control the rate of increases and decreases. Typical values are 1.1 < u < 1.3 and d slightly below 1/u, for example, d = 0.7. This gives a slight preference to learning rate decreases, making the system more stable.

In contrast to Jacobs' delta-bar-delta method where the learning rate increases incrementally (additively), here both increases and decreases are multiplicative. This allows faster increases in the learning rate and, possibly, faster convergence, but it may sometimes lead to instability. If the learning rate becomes too large, the error may sometimes jump abruptly (e.g., when the system oversteps a minimum and "climbs up a cliff"). To avoid instability, the bad weight change is retracted and in most cases reapplication of the learning rate update rule (9.14) using the gradient evaluated at the rejected point will reduce the learning rate adequately to avoid the bad step in following iterations; if not, the learning rate may need to be decreased directly. In a benchmarking test [320], it is suggested that if the algorithm fails to find an error decrease after five consecutive iterations, all the learning rate parameters should be halved.

Because the learning rate can increase quickly, there is not a huge cost in selecting an initial rate that is too small. Ideally, the algorithm should be able to correct for an overly large initial learning rate, but sigmoid saturation and instability may cause problems so it is probably best to start with a small value and let the algorithm increase it if necessary.

Performance seems to deteriorate in obliquely oriented ravines in the error surface. In order to better handle these cases, a modified weight update rule was proposed

(9.15)

where the 'smoothed gradient' is

(9.16)

and 0 α < 1 functions like the momentum parameter.

It has been reported [9] that methods that increase the learning rate multiplicatively like this can be faster than methods that increase it additively, but they are less stable and parameter tuning may be difficult.