Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

3.4 Learning Rules for Single-Layer Networks

A single-layer network can be trained by many methods. Almost every optimization technique has been applied to neural network training at some time or another. In addition to the general purpose optimization algorithms (some are reviewed in chapter 10) there are methods specialized for single-layer classifiers.

Rosenblatt's original perceptron learning algorithm (described in section 3.4 can be used for binary input-output problems that are known to be linearly separable. Statistical procedures can be applied in many cases. Linear discriminant analysis may be used for classification problems where the data form Gaussian clusters. The mathematics of single-layer perceptrons are similar to linear regression. Indeed, when f is linear, the output is a simple linear combination of the inputs and optimal weights can be calculated directly. Logistic regression (e.g., [178]) is very similar to learning binary functions with a single-layer network of sigmoid units using a cross-entropy error function.

When the node nonlinearity is smooth and the error function is differentiable, various gradient-based optimization methods including back-propagation can be used. More sophisticated gradient based techniques such as Newton's method, Levenberg-Marquardt, or conjugate gradient descent may be especially effective for single-layer networks because their error surface does not depart too much from the basic quadratic error function for which these methods are specialized.

3.4.1 The Perceptron Learning Algorithm

The perceptron learning algorithm is suitable for learning linearly separable binary functions of binary inputs and is guaranteed to find a solution if the classes are linearly separable. Unfortunately, if the classes are not linearly separable, the algorithm may not even converge and is unlikely to produce the best solution when it does. (Section 3.4.3 describes a possible way of handling this problem.) Because of these problems and the general limitations of single-layer networks, the method is rarely used except in special circumstances.

Because the outputs are binary, linear threshold units are used. Each unit computes the weighted sum u of its N inputs x_j, j = 1… N, and generates a binary output yA node threshold is absorbed into the weight vector by assuming the presence of a constant value bias unit, x_bias = 1. Input, output, and target values are assumed to be ± 1.

(3.12)

The weights can be updated by a number of simple learning rules [166]. During training, input patterns x are presented and the outputs y(x) are compared to the targets t(x). Weights are adapted by

(3.13)

(3.14)

where η is a small positive constant controlling the learning rate. Typically 0 < η < 1. Because t, y Î {-1, +1}, the following are equivalent:

(3.15)

and

(3.16)

In both cases, no change occurs when the output classification is correct. Otherwise, each element of the weight vector is incremented by 2η when the output is less than the target or decremented by 2η when the output is greater.

For improved reliability, it may be desirable that a unit activates only when the weighted sum u = w^Tx is greater than a threshold Nk, where 0 ≤ k < 1. The following rule may be used [325]:

(3.17)

where Θ(u) is the unit step function

Note that as the classification accuracy improves, the system makes fewer errors so weight changes become less and less frequent. The effective learning rate therefore slows down so convergence to perfect classification may take a long time.

Figure 3.9: The separation problem, (a) A weight vector must be found that separates the positive samples (open circles) from the negative samples (filled circles). (b) The transformation z = tx reflects the negative examples across the origin and changes the problem to one of finding a weight vector such that w.z > 0 for each transformed pattern z. D(w) is the minimum distance from any point z to the separating hyperplane defined by w. (c) The difficulty of a problem is determined by how much w can be varied and still meet the separating criterion. Easy problems can be satisfied by many w vectors.

3.4.2 Perceptron ConvergenceProof

With input, output, and target values of ±1, and using learning rule (3.17), a test of correct classification is that [166]

(3.18)

(w.x ≡ Sigma;_iw_ix_i denotes the inner product of w and x). As shown in figure 3.9(b), the transformation z =tx reflects the negative examples across the origin and changes the problem to one of finding a weight vector such that w . z > 0 for every pattern z. The points z will be classified correctly if

(3.19)

The Nk term adds the additional requirement that all z must be at least a distance Nk/||w|| from the origin and provides a margin for noise-and fault-tolerance.

The difficulty of a problem is determined by how much w can be varied and still meet the separating criterion. Easy problems can be satisfied by many hyperplanes while hard problems require very precise placement of the separator (figure 3.9(c)). Let D(w) be the minimum distance from any point z to the separating hyperplane defined by w

(3.20)

where the superscript μ indexes the points. D(w) is positive when all the points are on the positive side of the hyperplane, in which case (3.19) can be satisfied by making ||w|| large enough to ensure the safety margin Nk. There is some w_opt for which D_max = D(w_opt) is maximized; this defines the optimal perceptron.

The following proof of convergence is provided in [166]. Additional results may be found in [284], a good summary of early work in perceptron-like networks. At each step, a pattern is chosen and the weights are updated only if equation 3.19 is not satisfied. Let M^μ denote the number of times that pattern μ has been used to update the weights at some point in the learning process. Then at that time

(3.21)

if the initial weights are 0.

Assume that a solution vector w* exists. Because w* is a solution, D(w*) > 0. The proof computes bounds on ||w|| and on the overlap w . w*. These are used to show that w . w*/||w|| would get arbitrarily large if M = σ_μM^μ kept increasing. This is impossible, however, because w* is fixed, so updating must stop after some finite M.

Using (3.20) and (3.19)

(3.22)

(3.23)

(3.24)

≥ ηM min z^μ . w*

D(w*)||w*|| is a constant, so w . w* grows like M.

An upper bound on ||w|| is obtained by considering the change in length of w at a single update by pattern a α

(3.25)

(3.26)

(3.27)

(3.28)

The inequality comes from the fact that (3.19) is not satisfied when a weight update occurs. Because z^μ^i = ±(z^α)^2=N The upper bound results from summing the increments for M steps

(3.29)

Thus ||w|| grows no faster than √M

Next, consider the normalized scalar product

(3.30)

which is the square of the cosine of the angle between w and w* and cannot be greater than one. By (3.24) and (3.29),

(3.31)

which gives an upper bound on the number of weight updates (using the best possible w*)

(3.32)

where D_max ≡ max_w D(w) is D(w) maximized over all possible w. Thus M has an upper limit that is an unknown, but finite, constant. Because there can be no more than this number of updates, the weights must eventually converge to a final unchanging value.

This upper bound is proportional to N, the number of inputs, but does not depend on the number of patterns. In reality, however, in order to find the patterns that require weight updates, we have to cycle through the patterns, a process that takes time proportional to the number of patterns on the average. Also, D_max usually decreases as the number of patterns grows, resulting in a larger M. M also grows linearly with k, because for larger k, a larger ||w|| is needed to satisfy (3.19) along a good weight vector.

Equation 3.32 cannot be used to calculate the time required for convergence because it requires knowledge of a solution vector. Hampson and Volper [151] studied average case upper and lower bounds for M. The number of weight updates required is bounded by

(3.33)

where |X^2^maxis the largest squared magnitude of any input vector (N + 1 for N bivalent input units), w* is any solution vector, and The worst case upper bound is $ieq: An average case upper bound may be O(4^N) for large N. The worst case lower bound is M ≥ 2^N. The average case lower bound is O(1.4^N) for larger N [151]. It is reported that M grows linearly with the number of irrelevant inputs that are uncorrelated with the target outputs.

3.4.3 The Pocket Algorithm

Although the perceptron algorithm is guaranteed to learn pattern sets that are linearly separable, it does not converge when the training data is not linearly separable. Most of the time, the weight vector stays in regions that give small numbers of errors, but it does not always stay there. It is possible for the system to visit a near optimal set of weights and then wander away to a very bad set; the total error may be small at one moment and suddenly become large in the next so there is no guarantee that a weight vector obtained late in the training process will be near optimal.

Gallant's "pocket algorithm" [133] keeps a copy of the best weight vector obtained so far. Two sets of weights are maintained, a working set and the pocket set. Training examples are chosen randomly and whenever the current weights have a run of consecutive correct classifications longer than the best run by the pocket weights, they replace them. The quality of the pocket weights thus tends to improve over time and converge to an optimum set.

3.4.4 Rosenblatt's Perceptron

Single-layer sigmoidal networks are often called perceptrons even though this isn't strictly accurate. The original perceptron described by Rosenblatt [324], [325], [46] actually consisted of a family of networks, most having more than one layer and some having recurrent connections. In most cases, however, only the final layer of weights to the output units were modifiable. Connections in preceding layers had fixed values, either randomly chosen or set by some systematic procedure. The output nodes were often connected in a winner-take-all fashion. The label probably stuck because Minsky and Papert [268] reduced the structure to a single layer for the purpose of analyzing what could be learned by the modifiable units. The well-known result was to show that single-layer networks were adequate only for linearly separable functions.

Unlike a single-layer network, Rosenblatt's perceptron was capable of correctly classifying patterns that were not linearly separable if the preceding units had appropriate responses. But because only the final layer of weights was modifiable, it could not learn certain functions if this required adjustment of the responses of the preceding units.

Rosenblatt's perceptron is historically significant because it was one of the first biologically plausible models specified precisely enough to simulate and complex enough to show interesting behavior [46]. It was responsible for much of the initial interest in neural networks during the 1960s.

Now the term "perceptron" is commonly used to refer to any feedforward network of nodes with responses like f(w^T x) where ƒ is a sigmoid-like "squashing function." Single- layer networks are often simply called perceptrons and networks with more than one layer are typically called multilayer perceptrons.

Chapter 3 - Single-Layer Networks

3.4 Learning Rules for Single-Layer Networks

3.4.1 The Perceptron Learning Algorithm

3.4.2 Perceptron ConvergenceProof

3.4.3 The Pocket Algorithm

3.4.4 Rosenblatt's Perceptron