13.5
Discussion
Pruning algorithms have been proposed as a way to exploit
the learning advantages of larger systems while avoiding overfitting problems.
Many of the methods listed either calculate the sensitivity of the error to the
removal of elements or add terms to the error function that favor smaller
networks.
A disadvantage of most of the sensitivity methods is that they do
not detect correlated elements. The sensitivities are estimated under the
assumption that Wij is the only
weight to be deleted (or node i is the only
node to be deleted). After the first element is removed, the remaining
sensitivities are not necessarily valid for the smaller network. An extreme
example is two nodes whose effects cancel at the output. As a pair they have no
effect on the output, but each has a strong effect individually so neither will
be removed. Partially correlated nodes are a less extreme but more common
example. The optimal brain surgeon method, which uses the full Hessian, is
better able to deal with correlated weights.
In the original problem, there is the question of when to stop
training. With pruning algorithms, there is the similar question of when to stop
pruning. If separate training and validation sets are available, the choice may
be clear; if not, it may be somewhat arbitrary. Sensitivity methods delete
elements with the smallest sensitivities first and there may be a natural
stopping point where the sensitivity jumps suddenly. Penalty-term methods
control the amount of pruning by balancing the scaling factors of the error
terms. This choice may be tricky, however, if it must be made before training
begins, so some methods control these parameters dynamically. A compensating
advantage of the penalty-term methods is that training and pruning are
effectively done in parallel so the network can adapt to minimize errors
introduced by pruning.
Although pruning and penalty term methods may often be faster than
searching for and training a minimum size network, they do not necessarily
reduce training times because larger networks may take longer to train due to
sheer size and pruning takes some time itself. The goal, however, is improved
generalization rather than the training speed.
It should be noted that pruning alone is not a sufficient tool to
guarantee good generalization. Figure 14.11, for example, illustrates near minimal networks
that probably generalize differently. (Except for some possibly unneeded bias
connections, the 2/2/1 networks are minimal for the 2-bit XOR problem unless
short-cut connections are used.) No connections can be removed so pruning is not
an option.
It should also be noted that there is no guarantee that
pruning will always lead to a network that generalizes better. It may be that
the larger network allows an internal representation that is unavailable to a
smaller network and it may not be possible to change from one form to the other
by simply removing elements. Overfitting might be so severe that it cannot be
corrected by pruning alone. It may also turn out that pruning simply removes
redundant elements but leaves the overall response essentially unchanged. That
is, elements may be removed because they have little effect on the output;
removing them does not change the basic solution and does not lead to better
generalization. If the pruning algorithm removes elements without adjusting
remaining weights, the response of the pruned network is unlikely to be smoother
than the original—especially when small weights are more
likely to be removed than large ones. Figure
13.5 illustrates a case where the pruned network is likely to generalize
worse than the original.