Skip to Book Content
Book cover image

Chapter 13 - Pruning Algorithms

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

13.5 Discussion

Pruning algorithms have been proposed as a way to exploit the learning advantages of larger systems while avoiding overfitting problems. Many of the methods listed either calculate the sensitivity of the error to the removal of elements or add terms to the error function that favor smaller networks.

Click To expand
Figure 13.5: A pruning problem. The original network is underconstrained for the 2-input XOR problem and chooses a correct but unexpected solution. (The training points are the four corners of the square.) A naive pruning algorithm is able to remove many redundant elements, but the resulting network is unlikely to generalize better than the original.

A disadvantage of most of the sensitivity methods is that they do not detect correlated elements. The sensitivities are estimated under the assumption that Wij is the only weight to be deleted (or node i is the only node to be deleted). After the first element is removed, the remaining sensitivities are not necessarily valid for the smaller network. An extreme example is two nodes whose effects cancel at the output. As a pair they have no effect on the output, but each has a strong effect individually so neither will be removed. Partially correlated nodes are a less extreme but more common example. The optimal brain surgeon method, which uses the full Hessian, is better able to deal with correlated weights.

In the original problem, there is the question of when to stop training. With pruning algorithms, there is the similar question of when to stop pruning. If separate training and validation sets are available, the choice may be clear; if not, it may be somewhat arbitrary. Sensitivity methods delete elements with the smallest sensitivities first and there may be a natural stopping point where the sensitivity jumps suddenly. Penalty-term methods control the amount of pruning by balancing the scaling factors of the error terms. This choice may be tricky, however, if it must be made before training begins, so some methods control these parameters dynamically. A compensating advantage of the penalty-term methods is that training and pruning are effectively done in parallel so the network can adapt to minimize errors introduced by pruning.

Although pruning and penalty term methods may often be faster than searching for and training a minimum size network, they do not necessarily reduce training times because larger networks may take longer to train due to sheer size and pruning takes some time itself. The goal, however, is improved generalization rather than the training speed.

It should be noted that pruning alone is not a sufficient tool to guarantee good generalization. Figure 14.11, for example, illustrates near minimal networks that probably generalize differently. (Except for some possibly unneeded bias connections, the 2/2/1 networks are minimal for the 2-bit XOR problem unless short-cut connections are used.) No connections can be removed so pruning is not an option.

It should also be noted that there is no guarantee that pruning will always lead to a network that generalizes better. It may be that the larger network allows an internal representation that is unavailable to a smaller network and it may not be possible to change from one form to the other by simply removing elements. Overfitting might be so severe that it cannot be corrected by pruning alone. It may also turn out that pruning simply removes redundant elements but leaves the overall response essentially unchanged. That is, elements may be removed because they have little effect on the output; removing them does not change the basic solution and does not lead to better generalization. If the pruning algorithm removes elements without adjusting remaining weights, the response of the pruned network is unlikely to be smoother than the originalespecially when small weights are more likely to be removed than large ones. Figure 13.5 illustrates a case where the pruned network is likely to generalize worse than the original.