 
 | 
 | |||||
|  | |||||
 
  A rule of thumb for obtaining good generalization is to use the smallest system that will fit the data. Unfortunately, it usually is not obvious what size is best so a common approach is to train a series of networks of various sizes and choose the smallest one that will learn the data. The problem is that this can be time consuming if many networks must be trained before an acceptable one is found. Even if the optimum size were known in advance, the smallest network just complex enough to fit the data may be sensitive to initial conditions and learning parameters. It may be hard to tell if the network is too small to learn the data, if it is simply learning very slowly, or if it is stuck in a local minima due to an unfortunate set of initial conditions.
The idea behind pruning algorithms is to train an oversized network and then remove the unnecessary parts. The excess degrees of freedom of the large initial network allow it to learn reasonably quickly with less sensitivity to initial conditions while the reduced complexity of the trimmed system favors improved generalization. A side benefit is that the small resulting networks have other advantages such as economy and speed of operation. It may also be easier to interpret the logic behind their decisions as the network has less opportunity to spread functions over many nodes. (Some work has been done using pruning to help extract rules from a trained net, e.g., [374].)
The following sections summarize some pruning algorithms for feedforward networks like the multilayer perceptron, but the idea can also be applied to other systems such as associative networks [381] or tree structures [335].
Example Figure 13.1 illustrates the benefit of pruning. Figure 13.1(a) shows the boundary formed by an intentionally overtrained 2/50/10/1 network (after about 1800 epochs of RProp training). There are 671 weights in the network, but only 31 data points, so the network is very underconstrained. Although the data are nearly linearly separable (with some overlap near the boundary), the classification boundary found by the network is very nonlinear and would probably not generalize well on additional data from the same function. Figure 13.1(b) shows the same network after pruning by a simple algorithm. 659 of the original 671 weights and 56 of the 64 original nodes are removed. The network is reduced to 2/2/2/1 with 12 weights and the boundary is much smoother. Fig. 13.1(c) shows the response after 500 more epochs of training. The tuning smooths the response further and rotates the decision surface so the second input is almost ignored. It would probably be pruned in additional iterations. The simple algorithm used here had no way to remove hidden layers. A more sophisticated method could reduce the network to a one-dimensional solution with just 2 weights. (This illustrates pruning as a method of feature selection. If certain inputs are irrelevant to the problem, the algorithm should remove their connections to the network.)
1. Substantial parts of this chapter were published as [308].