Skip to Book Content
Book cover image

Chapter 5 - Back-Propagation

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

5.1 Preliminaries

Back-propagation can be applied to any feedforward network with differentiable activation functions. In particular, it is not necessary that it have a layered structure. An arbitrary feedforward network will be assumed in the following.

Feedforward Indexing For simplicity, assume the nodes are indexed so that i > j implies that node i follows node j in terms of dependency. That is, the state of node i may depend, perhaps indirectly, on the state of node j, but node j < i does not depend on node i. Such an index order is possible in any feedforward network, though it will not be unique in general. The advantage of this format is that it works in any feedforward network, including those with irregular structure and short-cut (layer skipping) connections. In simulations, it also lets us avoid the need to deal with each layer separately, keeping track of layer indexes. Of course, this indexing scheme is compatible with standard layered structures.

Because the dependencies are transmitted by the connection weights, connections are allowed from nodes with low indexes to nodes with higher indexes, but not vice versa. If wij denotes the weight to node i from node j then any forward link wij, j < i is allowed, but backward links are prohibited, wji 0. Figure 5.1 illustrates a possibility. Normally, the system inputs and the bias node will have low indexes since they potentially affect all other nodes and outputs will have high indexes.

Figure 5.1: Feedforward indexing in an unlayered network. The nodes in a feedforward network can always be indexed so that i > j if the state of node depends on the state of node (perhaps indirectly). Arbitrary connections are allowed from nodes with low indexes to nodes with higher indexes, but not vice versa; i > j implies wji $equiv; 0. This network has no particular function, but illustrates short-cut connections, apparently lateral (but still feedforward) connections, and the fact that outputs can be take from internal nodes.

5.1 Forward Propagation

In the forward pass, the network computes an output based on its current inputs. Each node i computes a weighted sum ai of its inputs and passes this through a nonlinearity to obtain the node output yi (see figure 5.2)

(5.3)
(5.4)

Normally f is a bounded monotonic function such as the tanh or sigmoid. Arbitrary differentiable functions can be used, but sigmoid-like "squashing" functions are standard. The index j in the sum runs over all indexes j < i of nodes that could send input to node i.

Click To expand
Figure 5.2: Forward propagation. In the forward pass, the input pattern is propagated through the network to obtain the output. Each node computes a weighted sum of its inputs and passes this through a nonlinearity, typically a sigmoid or tanh function.

If there is no connection from node j, weight wij is taken to be 0. As usual, it is assumed that there is a bias node with constant activation, ybias = 1, to avoid the need for special handling of the bias weights.

Every node is evaluated in order, starting with the first hidden node and continuing to the last output node. In layered networks, the first hidden layer is updated based on the external inputs, the second hidden layer is updated based on the outputs of the first hidden layer, and so on to the output layer which is updated based on the outputs of the last hidden layer. In software simulations, it is sufficient to evaluate the nodes in order by node index. Because node i does not depend on any nodes k > i, all inputs to node i will be valid when it is evaluated. At the end of the sweep, the system outputs will be available at the output nodes.

5.1.2 Error Calculation

Unless the network is perfectly trained, the network outputs will differ somewhat from the desired outputs. The significance of these differences is measured by an error (or cost) function E. In the following, we use the SSE error function

(5.5)

where p indexes the patterns in the training set, i indexes the output nodes, and dpi and ypi are, respectively, the desired target and actual network output for the ith output node on the pth pattern. The ½ factor suppresses a factor of 2 later on. One of the reasons SSE is convenient is that errors on different patterns and different outputs are independent; the overall error is just the sum of the individual squared errors

(5.6)
(5.7)