Skip to Book Content
Book cover image

Chapter 5 - Back-Propagation

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

Chapter 5: Back-Propagation

Overview

Back-propagation is, by far, the most commonly used method for training multilayer feedforward networks. The term back-propagation refers to two different things. First, it describes a method to calculate the derivatives of the network training error with respect to the weights by a clever application of the derivative chain-rule. Second, it describes a training algorithm, basically equivalent to gradient descent optimization, for using those derivatives to adjust the weights to minimize the error.

The algorithm was popularized by Rumelhart, Hinton, and Williams [329], [330], although earlier work had been done by Werbos [390], Parker [295], and Le Cun ([89]; summarized in [90]). Together with the Hopfield network, it was responsible for much of the renewed interest in neural networks in the mid-1980s. Before back-propagation, most networks used nondifferentiable hard-limiting binary nonlinearities such as step functions and there were no well-known general methods for training multilayer networks. The breakthrough was perhaps not so much the application of the chain-rule, but the demonstration that layered networks of differentiable nonlinearities could perform useful nontrivial calculations and that they offer (in some implementations) attractive features such as fast response, fault tolerance, the ability to "learn" from examples, and some ability to generalize beyond the training data.

As a training algorithm, the purpose of back-propagation is to adjust the network weights so the network produces the desired output in response to every input pattern in a predetermined set of training patterns. It is a supervised algorithm in the sense that, for every input pattern, there is an externally specified "correct" output which acts as a target for the network to imitate. Any difference between the network output and the target is treated as an error to be minimized. A "teacher" must decide which patterns to include in the training set and specify the correct output for each. It is an off-line algorithm in the sense that training and normal operation occur at different times. In the usual case, training could be considered part of the "manufacturing" process wherein the network is trained once for a particular function, then frozen and put into operation. Normally, no further learning occurs after the initial training phase.

To train a network, it is necessary to have a set of input patterns and corresponding desired outputs, plus an error function (cost function) that measures the "cost" of differences between network outputs and desired values. The basic steps are these.

  1. Present a training pattern and propagate it through the network to obtain the outputs.

  2. Compare the outputs with the desired values and calculate the error.

  3. Calculate the derivatives E/ wij of the error with respect to the weights.

  4. Adjust the weights to minimize the error.

  5. Repeat until the error is acceptably small or time is exhausted.

The error function measures the cost of differences between the network outputs and the desired values. The sum-of-squares function, below, is a common choice.

(5.1)

Here p indexes the patterns in the training set, i indexes the output nodes, and dpi and ypi are, respectively, the target and actual network output for the i th output node on the pth pattern. The mean-squared-error

(5.2)

normalizes ESSE for the number of training patterns P and network outputs N. Advantages of the SSE and MSE functions include easy differentiability and the fact that the cost depends only on the magnitude of the error. In particular, a deviation of a given magnitude has the same cost independent of the input pattern and independent of errors on other outputs. For classification problems, logarithmic or cross-entropy error functions (section 2.1) are sometimes used. For real-world applications, the cost function may be specialized to assign different costs to different sorts of deviations; similar errors on different input patterns may have different costs and the cost of an error on one output could depend on the errors on other outputs.