Skip to Book Content
Book cover image

Chapter 5 - Back-Propagation

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

5.2 Back-Propagation: The Derivative Calculation

Having obtained the outputs and calculated the error, the next step is to calculate the derivative of the error with respect to the weights. First we note that ESSE = Σ p Ep is just the sum of the individual pattern errors so the total derivative is just the sum of the per-pattern derivatives

(5.8)

The thing that makes back-propagation (the derivative calculation) efficient is how the operation is decomposed and the ordering of the steps. The derivative can be written

(5.9)

where the index k runs over all output nodes and aj is the weighted-sum input for node j obtained in equation 5.3. It is convenient to first calculate a value δi for each node i

(5.10)
(5.11)

which measures the contribution of ai to the error on the current pattern. For simplicity, pattern indexes p are omitted on yi,ai, and other variables below.

For output nodes, Ep/ak is obtained directly

(5.12)

The first term is obtained from equation 5.7,

(5.13)

(This is the SSE result; different expressions will be obtained for other error functions.) The second term

(5.14)

is just the slope f'k f'(ak) of the node nonlinearity at its current activation value. The sigmoid is convenient to use because f' is a simple function of the node output: f'(ak) = y(1-y), where y = f(ak). The tanh function is also convenient, f' (a k) = 1 y2.

For hidden nodes, δi is obtained indirectly. Hidden nodes can influence the error only through their effect on the nodes k to which they send output connections so

(5.15)

But the first factor is just the δk of node kso

(5.16)

The second factor is obtained by noting that if node i connects directly to node k then ak/ai = f'iwki, otherwise it is zero. So we end up with

(5.17)

(for hidden nodes). In other words, δi is a weighted sum of the δk values of nodes k to which it has connections wki (see figure 5.3).

Because δk must be calculated before δi, i < k the process starts at the output nodes and works backward toward the inputs, hence the name "back-propagation." First δ values are calculated for the output nodes, then values are calculated for nodes that send connections to the outputs, then values are calculated for nodes two steps removed from the outputs, and so forth.

To summarize so far,

(5.18)

For output nodes, δi depends only on the error di - yi and the local slope f'i of the node activation function. For hidden nodes, δi is a weighted sum of the δs of all the nodes it connects to, times its own slope f'i. Because of the way the nodes are indexed, all delta values can be updated in a single sweep through the nodes in reverse order. In layered networks, delta values are first evaluated at the output nodes based on the current pattern errors, the last hidden layer is then evaluated based on the output delta values, the second-to-last hidden layer is updated based on the values of the last hidden layer, and so on backwards to the input layer. Normally it is not necessary to calculate delta values for the input layer, so the process usually stops with the first hidden layer.

Click To expand
Figure 5.3: Backward propagation. Node deltas are calculated in the backward pass, starting at the output nodes and proceeding backwards. At each hidden node i,δi is calculated as a weighted linear combination of values δk, k > i,of the nodes k to which node i sends outputs. Note that the δ values travel backward against the normal direction of the connecting links.

Having obtained the node deltas, it is an easy step to find the partial derivatives Ep/w with respect to the weights. The second term in (5.9) is ak/wij. Because ak is a simple linear sum, this is zero ifk i; otherwise

(5.19)

where yj is the output activation of node j. Finally, from (5.9), the derivative of pattern error Ep with respect to weight wij is then

(5.20)

the product of the delta value at node i and the output of node j.

Derivatives with Respect to the Inputs Normally, delta values are not calculated for the input nodes because they are not needed to adjust the weights. Notice, however, that for input nodes δi = E/ ai is the derivative of the error with respect to the input. Also, if the network has a single output y then setting E = 1 and back-propagating gives the derivative of the output with respect to the input, y/x. There are, of course, useful applications for these derivatives. They can be used, for example, in inverse problems where we seek input values that produce a particular output.

A Finite-Difference Approximation As an alternative, the derivative can be estimated by a finite-difference approximation [44: 147]

(5.21)
where Î << 1 is a small offset. This is a weight perturbation technique [191], [192], [125]. Each weight must be perturbed twice and the error must be reevaluated for each perturbation. In a serial implementation, the error measurement takes O(W) time so the total time required scales like O(W2), where W is the number of weights in the network. This is slower than back-propagation, which takes O(W) time to find the derivative, but it is robust and simple to implement.

The method described perturbs each weight separately. A node perturbation technique is more efficient [44]. (The madeline III learning rule [403] is similar.) Recall from equation 5.10 that δi is the derivative of the error with respect to the node input sum, ai. Instead of perturbing the weights, the ai values of the hidden nodes are perturbed to obtain an approximation of the node deltas

(5.22)

Then equation 5.20 is used to calculate the gradient with respect to the weights. Each node must be perturbed twice and the error reevaluated. If there are H hidden nodes and W weights, this takes O(2HW) steps. Calculation of Ep takes O (W) time so the overall time scales like O(HW). Depending on the relative sizes of H and W, this can be much faster than the O(W2) time of the previous method, but it is still longer than the O(W) time of back-propagation. Summed weight perturbation [125] is a hybrid of weight and node perturbation methods.

Although finite-difference techniques are slower than back-propagation, they are simple to implement and useful for verify the correctness of more efficient calculations. They are also useful for training systems where analytic derivative calculation is not possible. Electronic circuit implementations, for example, may lack dedicated circuitry to do the back-propagation calculations. Finite-difference methods make it possible to estimate thederivatives using only feedforward calculations. In addition, they automatically account for nonideal circuit effects and faults that may occur in real systems.