Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

9.7 Quickprop

Fahlman's Quickprop [121] differs from most of the other methods mentioned here in that it is not an adaptive learning rate technique. Like back-propagation, it is a local method; each weight w is considered separately.

It is "based on 2 risky assumptions":

that E(w) for each weight can be approximated by a parabola that opens upward and
that the change in slope of E(w) for this weight is not affected by all the other weights that change at the same time.

The weight update rule is dominated by a quadratic term

(9.20)

where Call the S(t)/(S(t - 1) - S(t)) term β. The numerator is the derivative of the error with respect to the weight and (S(t - 1) - S(t))/Δw(t - 1) is a finite difference approximation of the second derivative. Together these approximate Newton's method for minimizing a one-dimensional function f(x): Δx = -f'(x)/f"(x). Sutton [363] suggested a similar update term.

Three cases occur:

If the current slope has the same sign but is somewhat smaller in magnitude than the previous one, then β > 0 and the weight will change again in the same direction. The size of the change will depend on how much the slope was reduced by the previous step.
If the current slope has a different sign from the previous slope, then the weight has crossed over the minimum and is now on the opposite side of the valley. Since β < 0, the next step will backtrack, landing somewhere between the current and previous positions.
The third case occurs when the current slope has the same sign as the previous slope, but is the same size or larger in magnitude. This indicates that the first "risky assumption" was not met and could occur where the function is not well-approximated by a parabola or where the assumed parabola opens downward.

To avoid taking an infinite step or a backward uphill move in case 3, a "maximum growth factor" parameter μ is introduced. No weight change is allowed to be larger than μ times the previous weight change. A value of μ = 1.75 is recommended. Chaotic behavior may result when it is too large,

For cases 1 and 3, an additional term -ηS(t) representing simple gradient descent is added to (9.20) to bootstrap the process when the previous change Δw(t - 1) = 0. It is ignored in case 2 when the current slope is nonzero and differs in sign from the previous one since the quadratic term handles this case well.

In addition to these weight update rules, several other heuristics are sometimes used.

It is argued that one of the reasons for the slow convergence of back-propagation is that the derivatives become very small when sigmoid node nonlinearities saturate. The sigmoidprime heuristic simply adds 0.1 to the derivative of the sigmoid function so that it is always nonzero. This may accelerate learning in flat regions, but it may also make it difficult to settle to a minimum.
Since the quadratic term may cause some weights to get very big, leading to floatingpoint overflow errors, a small decay term is added to the slope S(t) calculated for each weight. Note that this is different from normal weight-decay, which acts directly on the weights.
Finally, in some cases, a hyperbolic arctangent error function is used. That is, when backpropagating the error, the true derivative of the error with respect to the activation y of an output unit

is replaced by

-arctanh(d - y).

Strictly speaking, this is not an error function, as it modifies the calculated derivative, rather than the error itself. This goes to±∞ at ±1 and greatly magnifies the error for output units that are far from their target values. It also tends to cancel the vanishing derivative for nodes that are saturated at the wrong value, but this case is already handled by the sigmoidprime term. To avoid numerical problems, a value of 17 (-17) is used for inputs greater than 0.9999999 (less than -0.9999999). This assumes the errors are in (-1, +1). Simple scale changes will be needed for tanh nonlinearities and other cases. This heuristic is somewhat nonstandard and is not used in most cases.

In empirical comparisons, quickprop is often one of the faster, more reliable methods and outperforms most other heuristic variations of back-propagation on a wide range of problems. Only Rprop seems to be consistently better; it is perhaps somewhat more reliable, has fewer parameters to tune, and seems to be less sensitive to their values.

Quickprop does have a fixed learning rate parameter η that needs to be chosen to suit the problem. It might be possible to use adaptive methods to control this, but no methods have been described.

Chapter 9 - Faster Variations of Back-Propagation

9.7 Quickprop