12.1
Dynamic Node Creation
For networks trained by back-propagation and similar
methods, the most common procedure is to add new units when the error E reaches a (nonzero) plateau and stops decreasing,
presumably because it's in a local minimum. The triggering condition used by Ash
[11] (summarized in [277]) is
(12.1) |
 |
![]()
where
to is the time at which the previous new
node was added, δ is an interval over which the slope is measured, and δ
T is the trigger value (
figure
12.1). The requirement (
t≥to+δ) allows
the network some time to adapt to the addition of the previous node before
another is added.
The decision to stop is made by observing the size of the largest
errors. The reasoning is that with cost functions like MSE, the error can often
be reduced by making small improvements on many easy cases while ignoring large
errors on a few difficult cases. As a result, the worst error may increase even
as the average error decreases. Both errors tend to jump discontinuously when
new units are added. The average error is normally small in later stages of
learning so discontinuities in it might not be obvious, but the worst error is
usually relatively large and often drops significantly when the critical number
of units is reached. (This assumes that the data is consistent, however. If two
training patterns with identical inputs have different targets, then no number
of units will be able to reduce the worst error to zero.)
Hirose, Yamashita, and Hijiya [170] describe a similar method for adding nodes to a
single-hidden-layer network when training stalls. The total error is evaluated
periodically, for example, after every 100 weight updates. If it has not
decreased significantly, for example, by at least 1%, then a new node is added.
Note that these values probably depend on problem size and difficulty. For
networks with many weights or problems with many training patterns, it may be
necessary to wait longer before deciding that the network is stalled. An
algorithm that does not wait long enough could add nodes continuously.
New nodes are added with small random initial weights. This
perturbs the solution and usually causes the error to increase. The error
normally decreases in subsequent training,however, which prevents the algorithm
from immediately adding another new node in the next cycle. The perturbation
could be avoided by initializing the new node with zero-valued weights, but
zero-valued weights tend to remain near zero under back-propagation so E(t) would
remain flat and a new node would be added at the next opportunity, leading to a
proliferation of redundant nodes with very small weights.
Since this procedure can only add nodes, the network may
become very large. This is generally undesirable so the training phase is
followed by a pruning phase which removes unnecessary nodes. Any reasonable
pruning algorithm would probably work; the following method is used by Hirose,
Yamashita, and Hijiya [170]. One node is removed at a time and the network is
retrained; if it converges, then another hidden node is removed and so on. At
some point, the network will not converge after removal of a node; the node is
restored and the process halted. In the simulations described, the most recently
added hidden nodes were removed first.