Skip to Book Content
Book cover image

Chapter 12 - Constructive Methods

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

12.1 Dynamic Node Creation

For networks trained by back-propagation and similar methods, the most common procedure is to add new units when the error E reaches a (nonzero) plateau and stops decreasing, presumably because it's in a local minimum. The triggering condition used by Ash [11] (summarized in [277]) is

(12.1)

 

where to is the time at which the previous new node was added, δ is an interval over which the slope is measured, and δT is the trigger value (figure 12.1). The requirement (tto+δ) allows the network some time to adapt to the addition of the previous node before another is added.

Figure 12.1: In many constructive algorithms, new units are added when the rate of improvement of the training error becomes small. Normalization by the magnitude of the error E(to) obtained after the previous unit was added makes the triggering condition less dependent on the size of the error.

The decision to stop is made by observing the size of the largest errors. The reasoning is that with cost functions like MSE, the error can often be reduced by making small improvements on many easy cases while ignoring large errors on a few difficult cases. As a result, the worst error may increase even as the average error decreases. Both errors tend to jump discontinuously when new units are added. The average error is normally small in later stages of learning so discontinuities in it might not be obvious, but the worst error is usually relatively large and often drops significantly when the critical number of units is reached. (This assumes that the data is consistent, however. If two training patterns with identical inputs have different targets, then no number of units will be able to reduce the worst error to zero.)

Hirose, Yamashita, and Hijiya [170] describe a similar method for adding nodes to a single-hidden-layer network when training stalls. The total error is evaluated periodically, for example, after every 100 weight updates. If it has not decreased significantly, for example, by at least 1%, then a new node is added. Note that these values probably depend on problem size and difficulty. For networks with many weights or problems with many training patterns, it may be necessary to wait longer before deciding that the network is stalled. An algorithm that does not wait long enough could add nodes continuously.

New nodes are added with small random initial weights. This perturbs the solution and usually causes the error to increase. The error normally decreases in subsequent training,however, which prevents the algorithm from immediately adding another new node in the next cycle. The perturbation could be avoided by initializing the new node with zero-valued weights, but zero-valued weights tend to remain near zero under back-propagation so E(t) would remain flat and a new node would be added at the next opportunity, leading to a proliferation of redundant nodes with very small weights.

Since this procedure can only add nodes, the network may become very large. This is generally undesirable so the training phase is followed by a pruning phase which removes unnecessary nodes. Any reasonable pruning algorithm would probably work; the following method is used by Hirose, Yamashita, and Hijiya [170]. One node is removed at a time and the network is retrained; if it converges, then another hidden node is removed and so on. At some point, the network will not converge after removal of a node; the node is restored and the process halted. In the simulations described, the most recently added hidden nodes were removed first.