Skip to Book Content
Book cover image

Chapter 3 - Single-Layer Networks

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

3.5 Adalines and the Widrow-Hoff Learning Rule

Adaline networks (short for adaptive linear neuron) were described by Widrow et al. in the 1960s [402], [399], [404], [401], [400], [357]. The structure is a single-layer network with a step function as the nonlinearity. Although the perceptron was analyzed assuming binary inputs, Adaline inputs may be continuous. Weights are adjusted with the Widrow-Hoff learning rule to minimize the difference between the output and an externally supplied target.

The Widrow-Hoff learning rule, also called the LMS algorithm or the delta rule, is basically an iterative implementation of linear regression (see appendix A). Both minimize the mean squared error of a linear fit. Ideally, the solutions are the same so it shares many properties with linear regression and succeeds or fails in similar situations.

An input pattern x is selected at random from the training set, applied to the input, and the output is calculated by equations 3.12 and 3.13. The weights are then updated by

(3.34)

where η is a small positive constant learning rate. Note that this minimizes the difference between the target and the weighted input sum u, not the output y = f(u). Weight adjustments are repeated with new patterns until the total error (summed over all training patterns) is minimized.

The error is reduced by a factor of η each time the weights are updated with the input pattern fixed. Stability [406] requires that 0 < η < 2 and generally 0.1 < η < 1.0. For η = 1 the error on the present pattern is completely corrected in one cycle; for η > 1, it is overcorrected. Because of the linearity of the rule, the error function is quadratic and has only one global minimum; there are no local minima.

If the input patterns are linearly independent, the weights will converge to unique values. If not, the corrections will be oscillatory and η should decrease over time to allow the weights to settle. One possible schedule is η = k-1, where k indexes the iterations [212].

Vectors a and b are orthogonal if aTb = 0. A set of vectors $ieq: are orthonormal if any two distinct vectors $ieq: are orthogonal and each has unit length. For orthonormal input patterns, the optimal weights to which the rule converges iteratively can be computed directly [359]

(3.35)

Here p indexes training patterns, not vector elements; xp is the p-th input pattern, a column vector, and tp is the p-th output target. This is essentially Hebbian learning. Because the patterns are orthogonal, stored patterns don't interfere with each other so recall is perfect

Orthonormality is not, however, a requirement for perfect recall of the stored patterns. The correct output will be produced in response to every stored pattern as long as the patterns are merely linearly independent. Unlike the perceptron algorithm that does not converge when the patterns are not linearly separable, the delta rule converges to the optimal mean-squared-error solution. When the patterns are linearly separable, the minimum MSE solution may separate the classes, but this is not guaranteed. There are linearly separable data sets that a MSE solution does not separate correctly.

3.5.1 Adaline Capacity

An Adaline can store and recall perfectly up to N linearly independent patterns. This is limited by the dimension of the input since N patterns can be linearly independent only if N d, where d is the dimension of the patterns (the number of inputs including the bias unit).

For random input patterns and targets, the capacity approaches 2d [401]. This is an average capacity for random inputs and targets and is determined by the results discussed in section 3.3. But as few as four patterns can create a not-linearly-separable set that is unlearnable by the Adaline. In a sense, random data is the hardest to learn. In most real problems, however, patterns and targets are not random; the target usually depends on the inputs in a systematic way. When the patterns are linearly separable (or nearly so), the system is able to correctly classify many more than 2d patterns [401]. The case of two well-separated clusters is a simple example.

In simulations, it is reported that approximately 5d training passes are required to achieve a 20% error level [402], where d is the number of bits per pattern. This generalizes to approximately d/Î where Î is the acceptable error level. For linearly separable problems, it is sufficient to train only on the border patterns since all patterns farther from the border will be classified correctly if the border patterns are [401].