Skip to Book Content
Book cover image

Chapter 9 - Faster Variations of Back-Propagation

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology

9.6 Rprop

Rprop [315, 314] stands for "resilient propagation". The main difference between it and most other heuristic back-propagation variations is that the learning rate adjustments and weight changes depend only on the signs of the gradient terms, not their magnitudes. It is argued that the gradient magnitude depends on scaling of the error function and can change greatly from one step to the next. On a complicated nonlinear error surface, the magnitude is basically unpredictable a priori and there is no reason why the step size should be proportional to the magnitude in general. In fact, it can be argued that the step size should be inversely proportional in order to take large steps where the gradient is small and to take small careful steps where the gradient is large [363].

Rprop is a batch update method; the weights and step sizes are changed once per epoch. Each weight wij has its own step size, or update-value, Δij, which varies with time t according to

(9.17)

where 0 < η- < 1 < η+. A change in sign of the partial derivative corresponding to weight wij indicates that the last update was too big and the system has jumped over a minimum so the update value Δij is decreased by a factor η-. Consecutive derivatives with the same sign indicate that the system is moving steadily in one direction so the update value is increased slightly in order to accelerate convergence in shallow regions.

The weights are changed according to

(9.18)

Note that the change depends only on the sign of the partial derivative and is independent of its magnitude. If the derivative is positive, the weight is decremented by Δij(t); if the derivative is negative, the weight is incremented by Δij(t).

There is one exception. If the partial derivative changes sign (indicating that the previous step was too large and a minimum was missed), the previous weight-update is retracted

(9.19)

Because this would cause another sign change on the next step, leading Δij(t) to be decrease, the update-value is not adapted on the next step. In software, this can be achieved by storing , which prevents the change in the next step.

All update-values are initialized to a constant Δij = Δo, which determines the size of the first weight change. A reasonable value is Δo = 0.1. This is somewhat affected by the size of the initial weights, but does not seem to be critical for simple problems. It is probably better to err in favor of choosing too small a value because an overly large value could lead to immediate node saturation. In [314], Δo = 0.001 was used for the two-spirals problem, but values between 10-5 and 0.01 gave similar results.

The range of update-values is limited to Δmin = 10-6 and Δmax = 50 to avoid floatingpoint underflow-overflow problems. Limiting Δmax to smaller values, for example, 1, may give smoother decreases in the error at the cost of slower convergence. In [314], Δmax = 0.1 was used for the two-spirals problem.

The value η- = 0.5 was chosen based on the reasoning that when the system overshoots a minimum, the minimum will be halfway between the current and previous weights, on average, so the step size should be reduced to half its previous value.

The value η+ = 1.2 is a compromise. It should be large enough to allow fast growth in flat regions of the error function, but not so large that the system has to immediately reduce the update-value in the next step. The value 1.2 seems to work well on many problems and usually is not critical.

These default values seem to work well for most problems. In most cases, no changes are needed. In [315], only Δmax = 0.001 was changed for the two-spirals problem in order to avoid early saturation of the sigmoids. In most cases, Δo is the only other parameter that needs to be changed and its value is not critical as long as it is not too large.

Although it is not mentioned in the derivation, momentum can be used with beneficial effects on many problems. As usual, very high values of momentum may lead to instability.

In empirical comparisons, Rprop seems to be one of the faster and more reliable heuristic methods for a wide range of problems. There are, of course, cases where other methods do better, but Rprop is often a good choice for initial tests. For certain classification problems where the error criteria are satisfied as soon as all outputs are within a tolerance (e.g., 0.1) of their target values, it can be faster than second-order gradient methods such as conjugate gradient or Levenberg-Marquardt. This is problem dependent, however.

The success of Rprop can be explained, in part, by two factors. First, one reason for the slow convergence of gradient descent is that the gradient vanishes at a minimum so the step size becomes smaller and smaller as it nears the minimum. The error tends to decrease exponentially: fast at first, but slower later on. With Rprop, the step size does not depend on the magnitude of the gradient so learning does not slow to a crawl in the final stages.

Second, another problem with back-propagation in layered networks is that the derivatives tend to be attenuated as they propagate back from the output layer toward the inputs (see section 6.1.8). Each layer inserts a sigmoid derivative factor that is less than 1 ( 0.25 for sigmoids, 1 for tanh nodes) with the result that |E/w| tends to be very small for weights far from the outputs and learning is correspondingly slow. Deep networks with many layers have been avoided for this reason because almost no learning occurs in the initial layers. Heuristic methods for setting different learning rates for each layer have been investigated, but they are difficult to tune by hand and a fixed learning rate is not necessarily appropriate anyway. Rprop seems to work better than some other adaptive learning rate techniques in this case because the learning rate adjustments and weight updates depend only on the signs of the derivatives, not their magnitudes. Appropriate values can be found for each layer so early layers learn faster than they would otherwise and deep networks are not as difficult to train.