Skip to Book Content
Book cover image

Appendix A - Linear Regression

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

A.3 The LMS Algorithm

The Widrow-Hoff learning rule [402], also called the LMS (least mean squares) algorithm or the delta rule is basically an iterative on-line implementation of linear regression. Like gradient descent, it avoids storing R, but the LMS method reduces storage requirements even further. Gradient descent still requires O(N) storage to accumulate error terms for the entire training set in order to approximate the true gradient before making a weight change. The LMS method avoids this may making weight changes as soon as errors occur. In the limit of very small learning rates, the result is the same. Like linear regression, it also minimizes the mean squared error of a linear fit so it shares many properties with linear regression and succeeds or fails in the same situations. An extensive summary of the LMS algorithm is provided by Widrow and Stearns [405].

As before, the half mean squared error is E = ½E [e2] and the gradient is

The LMS algorithm does steepest descent using an estimate of the gradient based only on the error on the current pattern

(A.23)
(A.24)

That is, it does on-line rather than batch learning. The update rule is then

(A.25)
(A.26)

where 0 < η < 2/λmax for stability. Here, subscripts index pattern presentations rather than vector elements.

Because λmax is unknown unless R is analyzed, an estimate must be used. R is nonnegative-definite, so all eigenvalues are nonnegative λi 0 and λmax can be bounded by

(A.27)

where trace(R) = irii = i E [x2i]. Because this places a upper bound on λmax that may too high, the resulting η value may be smaller than necessary. Adaptive learning rate methods, perhaps initialized this way, may be able to improve training speed by adjusting the rate based on observed training performance. As an aside, this provides justification for centering input variables because E [x2i] = σ2i + μ2i where σ2i and μi are the variance and mean of input xi; zero-mean inputs will produce lower estimates of λmax and allow larger step sizes to be used.