Overview
A rule of thumb for obtaining good generalization is to use
the smallest system that will fit the data. Unfortunately, it usually is not
obvious what size is best so a common approach is to train a series of networks
of various sizes and choose the smallest one that will learn the data. The
problem is that this can be time consuming if many networks must be trained
before an acceptable one is found. Even if the optimum size were known in
advance, the smallest network just complex enough to fit the data may be
sensitive to initial conditions and learning parameters. It may be hard to tell
if the network is too small to learn the data, if it is simply learning very
slowly, or if it is stuck in a local minima due to an unfortunate set of initial
conditions.
The idea behind pruning algorithms is to train an oversized
network and then remove the unnecessary parts. The excess degrees of freedom of
the large initial network allow it to learn reasonably quickly with less
sensitivity to initial conditions while the reduced complexity of the trimmed
system favors improved generalization. A side benefit is that the small
resulting networks have other advantages such as economy and speed of operation.
It may also be easier to interpret the logic behind their decisions as the
network has less opportunity to spread functions over many nodes. (Some work has
been done using pruning to help extract rules from a trained net, e.g., [374].)
The following sections summarize some pruning algorithms for
feedforward networks like the multilayer perceptron, but the idea can also be
applied to other systems such as associative networks [381] or tree structures [335].
Example Figure
13.1 illustrates the benefit of pruning. Figure
13.1(a) shows the boundary formed by an intentionally overtrained 2/50/10/1
network (after about 1800 epochs of RProp training). There are 671 weights in
the network, but only 31 data points, so the network is very underconstrained.
Although the data are nearly linearly separable (with some overlap near the
boundary), the classification boundary found by the network is very nonlinear
and would probably not generalize well on additional data from the same
function. Figure
13.1(b) shows the same network after pruning by a simple algorithm. 659 of
the original 671 weights and 56 of the 64 original nodes are removed. The
network is reduced to 2/2/2/1 with 12 weights and the boundary is much smoother.
Fig.
13.1(c) shows the response after 500 more epochs of training. The tuning
smooths the response further and rotates the decision surface so the second
input is almost ignored. It would probably be pruned in additional iterations.
The simple algorithm used here had no way to remove hidden layers. A more
sophisticated method could reduce the network to a one-dimensional solution with
just 2 weights. (This illustrates pruning as a method of feature selection. If
certain inputs are irrelevant to the problem, the algorithm should remove their
connections to the network.)
1. Substantial parts of this chapter were published as [308].