Skip to Book Content
Book cover image

Chapter 9 - Faster Variations of Back-Propagation

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

9.8 Search Then Converge

Most of the other methods mentioned in this chapter are designed for batch-mode learning. The following describes an adaptive learning rate method for on-line learning.

As noted in section 5.3.2, the weight trajectory in on-line learning is stochastic and jitters around the error surface. This randomness helps search more of the weight space and makes the system more likely to find a good minimum, but it also keeps the weights from settling to a solution so the asymptotic error may be relatively high. The standard solution is to reduce the learning rate gradually as learning progresses.

The classic schedule used in stochastic approximation [318] is η(t) = c/t where c is a constant. This guarantees asymptotic convergence and is optimal for c greater than some threshold c*, which depends on the problem [97]. There are problems with this, however. Convergence is slow when c is small, but if c is increased too much then excessively large parameter changes may occur at small t.

Darken and Moody [97] proposed the "search then converge" schedule

(9.21)

This avoids the unstable behavior at small t, yet still has the desired asymptotic behavior c/t for t >> τ. For t << τ, η(t) ηo and the system behaves like normal on-line learning with a constant learning rate. It is hoped that by the time t τ the system will converge to and then hover around a good minimum. At t τ, η(t) begins decreasing to allow the weights to settle to the solution. For t >>τ, η(t) c/t where c = τηo, and the learning rate approaches the optimum stochastic approximation schedule. The schedule [98]

(9.22)

has similar behavior, but decreases η(t) faster at intermediate values of t.

A defect of these schedules is that they require the user to choose the parameter c. The optimal value is c* 1/2α whereα is the smallest eigenvalue of the Hessian evaluated at the minimum [98]. c* is usually unknown, however, because the minimum has not been found yet. In theory, the Hessian could be estimated and its eigenvalues calculated, but this is computationally intensive and may not be possible in an on-line learning situation. (The main advantages of on-line learning are its computational simplicity and small storage requirements).

Darken and Moody [98] propose a way to do an on-line estimate of whether c > c* by observing the trajectory of the weight vector. The idea is that when c is too small, successive weight update vectors will be highly correlated. Convergence is slow because the weight changes are small; the gradient changes little from one step to the next so successive weight updates tend to point in similar directions.

An estimate of the drift is

(9.23)
(9.24)

where δk(t) is the change in the kth component of the weight vector at time t, and the brackets <.>T indicate the average over T weight changes. In [98], T = at, a << 1. The numerator is the average parameter change. The denominator is the standard deviation of the weight changes and becomes small when weight updates are highly correlated over time. D(t) grows like a power of t when c is too small but remains finite otherwise.