Skip to Book Content
Book cover image

Chapter 9 - Faster Variations of Back-Propagation

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

9.7 Quickprop

Fahlman's Quickprop [121] differs from most of the other methods mentioned here in that it is not an adaptive learning rate technique. Like back-propagation, it is a local method; each weight w is considered separately.

It is "based on 2 risky assumptions":

The weight update rule is dominated by a quadratic term

(9.20)

where Call the S(t)/(S(t - 1) - S(t)) term β. The numerator is the derivative of the error with respect to the weight and (S(t - 1) - S(t))/Δw(t - 1) is a finite difference approximation of the second derivative. Together these approximate Newton's method for minimizing a one-dimensional function f(x): Δx = -f'(x)/f"(x). Sutton [363] suggested a similar update term.

Three cases occur:

  1. If the current slope has the same sign but is somewhat smaller in magnitude than the previous one, then β > 0 and the weight will change again in the same direction. The size of the change will depend on how much the slope was reduced by the previous step.

  2. If the current slope has a different sign from the previous slope, then the weight has crossed over the minimum and is now on the opposite side of the valley. Since β < 0, the next step will backtrack, landing somewhere between the current and previous positions.

  3. The third case occurs when the current slope has the same sign as the previous slope, but is the same size or larger in magnitude. This indicates that the first "risky assumption" was not met and could occur where the function is not well-approximated by a parabola or where the assumed parabola opens downward.

To avoid taking an infinite step or a backward uphill move in case 3, a "maximum growth factor" parameter μ is introduced. No weight change is allowed to be larger than μ times the previous weight change. A value of μ = 1.75 is recommended. Chaotic behavior may result when it is too large,

For cases 1 and 3, an additional term -ηS(t) representing simple gradient descent is added to (9.20) to bootstrap the process when the previous change Δw(t - 1) = 0. It is ignored in case 2 when the current slope is nonzero and differs in sign from the previous one since the quadratic term handles this case well.

In addition to these weight update rules, several other heuristics are sometimes used.

is replaced by

-arctanh(d - y).

Strictly speaking, this is not an error function, as it modifies the calculated derivative, rather than the error itself. This goes to± at ±1 and greatly magnifies the error for output units that are far from their target values. It also tends to cancel the vanishing derivative for nodes that are saturated at the wrong value, but this case is already handled by the sigmoidprime term. To avoid numerical problems, a value of 17 (-17) is used for inputs greater than 0.9999999 (less than -0.9999999). This assumes the errors are in (-1, +1). Simple scale changes will be needed for tanh nonlinearities and other cases. This heuristic is somewhat nonstandard and is not used in most cases.

In empirical comparisons, quickprop is often one of the faster, more reliable methods and outperforms most other heuristic variations of back-propagation on a wide range of problems. Only Rprop seems to be consistently better; it is perhaps somewhat more reliable, has fewer parameters to tune, and seems to be less sensitive to their values.

Quickprop does have a fixed learning rate parameter η that needs to be chosen to suit the problem. It might be possible to use adaptive methods to control this, but no methods have been described.