Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

5.2 Back-Propagation: The Derivative Calculation

Having obtained the outputs and calculated the error, the next step is to calculate the derivative of the error with respect to the weights. First we note that E_SSE = Σ _pE_p is just the sum of the individual pattern errors so the total derivative is just the sum of the per-pattern derivatives

(5.8)

The thing that makes back-propagation (the derivative calculation) efficient is how the operation is decomposed and the ordering of the steps. The derivative can be written

(5.9)

where the index k runs over all output nodes and a_j is the weighted-sum input for node j obtained in equation 5.3. It is convenient to first calculate a value δ_i for each node i

(5.10)

(5.11)

which measures the contribution of a_i to the error on the current pattern. For simplicity, pattern indexes p are omitted on y_i,a_i, and other variables below.

For output nodes, ∂E_p/∂a_k is obtained directly

(5.12)

The first term is obtained from equation 5.7,

(5.13)

(This is the SSE result; different expressions will be obtained for other error functions.) The second term

(5.14)

is just the slope f'_k ≡ f'(a_k) of the node nonlinearity at its current activation value. The sigmoid is convenient to use because f' is a simple function of the node output: f'(a_k) = y(1-y), where y = f(a_k). The tanh function is also convenient, f' (a _k) = 1 y².

For hidden nodes, δ_i is obtained indirectly. Hidden nodes can influence the error only through their effect on the nodes k to which they send output connections so

(5.15)

But the first factor is just the δ_k of node kso

(5.16)

The second factor is obtained by noting that if node i connects directly to node k then ∂a_k/∂a_i = f'_iw_ki, otherwise it is zero. So we end up with

(5.17)

(for hidden nodes). In other words, δ_i is a weighted sum of the δ_k values of nodes k to which it has connections w_ki (see figure 5.3).

Because δ_k must be calculated before δ_i, i < k the process starts at the output nodes and works backward toward the inputs, hence the name "back-propagation." First δ values are calculated for the output nodes, then values are calculated for nodes that send connections to the outputs, then values are calculated for nodes two steps removed from the outputs, and so forth.

To summarize so far,

(5.18)

For output nodes, δ_i depends only on the error d_i - y_i and the local slope f'_i of the node activation function. For hidden nodes, δ_i is a weighted sum of the δs of all the nodes it connects to, times its own slope f'_i. Because of the way the nodes are indexed, all delta values can be updated in a single sweep through the nodes in reverse order. In layered networks, delta values are first evaluated at the output nodes based on the current pattern errors, the last hidden layer is then evaluated based on the output delta values, the second-to-last hidden layer is updated based on the values of the last hidden layer, and so on backwards to the input layer. Normally it is not necessary to calculate delta values for the input layer, so the process usually stops with the first hidden layer.

Figure 5.3: Backward propagation. Node deltas are calculated in the backward pass, starting at the output nodes and proceeding backwards. At each hidden node i,δ_i is calculated as a weighted linear combination of values δ_k, k > i,of the nodes k to which node i sends outputs. Note that the δ values travel backward against the normal direction of the connecting links.

Having obtained the node deltas, it is an easy step to find the partial derivatives ∂E_p/∂w with respect to the weights. The second term in (5.9) is ∂a_k/∂w_ij. Because a_k is a simple linear sum, this is zero ifk ≠ i; otherwise

(5.19)

where y_j is the output activation of node j. Finally, from (5.9), the derivative of pattern error E_p with respect to weight w_ij is then

(5.20)

the product of the delta value at node i and the output of node j.

Derivatives with Respect to the Inputs Normally, delta values are not calculated for the input nodes because they are not needed to adjust the weights. Notice, however, that for input nodes δ_i = ∂E/∂ a_i is the derivative of the error with respect to the input. Also, if the network has a single output y then setting E = 1 and back-propagating gives the derivative of the output with respect to the input, ∂y/∂x. There are, of course, useful applications for these derivatives. They can be used, for example, in inverse problems where we seek input values that produce a particular output.

A Finite-Difference Approximation As an alternative, the derivative can be estimated by a finite-difference approximation [44: 147]

(5.21)

where Î << 1 is a small offset. This is a weight perturbation technique [191], [192], [125]. Each weight must be perturbed twice and the error must be reevaluated for each perturbation. In a serial implementation, the error measurement takes O(W) time so the total time required scales like O(W²), where W is the number of weights in the network. This is slower than back-propagation, which takes O(W) time to find the derivative, but it is robust and simple to implement.

The method described perturbs each weight separately. A node perturbation technique is more efficient [44]. (The madeline III learning rule [403] is similar.) Recall from equation 5.10 that δ_i is the derivative of the error with respect to the node input sum, a_i. Instead of perturbing the weights, the a_i values of the hidden nodes are perturbed to obtain an approximation of the node deltas

(5.22)

Then equation 5.20 is used to calculate the gradient with respect to the weights. Each node must be perturbed twice and the error reevaluated. If there are H hidden nodes and W weights, this takes O(2HW) steps. Calculation of E_p takes O (W) time so the overall time scales like O(HW). Depending on the relative sizes of H and W, this can be much faster than the O(W²) time of the previous method, but it is still longer than the O(W) time of back-propagation. Summed weight perturbation [125] is a hybrid of weight and node perturbation methods.

Although finite-difference techniques are slower than back-propagation, they are simple to implement and useful for verify the correctness of more efficient calculations. They are also useful for training systems where analytic derivative calculation is not possible. Electronic circuit implementations, for example, may lack dedicated circuitry to do the back-propagation calculations. Finite-difference methods make it possible to estimate thederivatives using only feedforward calculations. In addition, they automatically account for nonideal circuit effects and faults that may occur in real systems.

Chapter 5 - Back-Propagation

5.2 Back-Propagation: The Derivative Calculation