Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

A.3 The LMS Algorithm

The Widrow-Hoff learning rule [402], also called the LMS (least mean squares) algorithm or the delta rule is basically an iterative on-line implementation of linear regression. Like gradient descent, it avoids storing R, but the LMS method reduces storage requirements even further. Gradient descent still requires O(N) storage to accumulate error terms for the entire training set in order to approximate the true gradient before making a weight change. The LMS method avoids this may making weight changes as soon as errors occur. In the limit of very small learning rates, the result is the same. Like linear regression, it also minimizes the mean squared error of a linear fit so it shares many properties with linear regression and succeeds or fails in the same situations. An extensive summary of the LMS algorithm is provided by Widrow and Stearns [405].

As before, the half mean squared error is E = ½E [e²] and the gradient is

The LMS algorithm does steepest descent using an estimate of the gradient based only on the error on the current pattern

(A.23)

(A.24)

That is, it does on-line rather than batch learning. The update rule is then

(A.25)

(A.26)

where 0 < η < 2/λ_max for stability. Here, subscripts index pattern presentations rather than vector elements.

Because λ_max is unknown unless R is analyzed, an estimate must be used. R is nonnegative-definite, so all eigenvalues are nonnegative λ_i ≥ 0 and λ_max can be bounded by

(A.27)

where trace(R) = ∑_ir_ii = ∑_i E [x²_i]. Because this places a upper bound on λ_max that may too high, the resulting η value may be smaller than necessary. Adaptive learning rate methods, perhaps initialized this way, may be able to improve training speed by adjusting the rate based on observed training performance. As an aside, this provides justification for centering input variables because E [x²_i] = σ²_i + μ²_i where σ²_i and μ_i are the variance and mean of input x_i; zero-mean inputs will produce lower estimates of λ_max and allow larger step sizes to be used.

Appendix A - Linear Regression

A.3 The LMS Algorithm