Skip to Book Content
Book cover image

Chapter 13 - Pruning Algorithms

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

13.4 Other Methods

13.4.1 Interactive Pruning

Sietsma and Dow [345], [344] describe an interactive method in which the designer inspects a trained network and decides which nodes to remove. A network of linear threshold elements is considered. Several heuristics are used to identify noncontributing units:

  • If a unit has a constant output over all training patterns, then it is not participating in the solution and can be removed. (Thresholds of units in following layers may need to be adjusted to compensate for its constant output.)

  • If a number of units have highly correlated responses (e.g., identical or opposite), then they are redundant and can be combined into a single unit. All their output weights should be added together so the combined unit has the same effect on following units.

Units are unlikely to be exactly correlated (or have exactly constant output if sigmoid nodes are used), so application of the heuristics calls for some judgment.

A second stage of pruning removes nodes that are linearly independent of other nodes in the same layer, but which are not strictly necessary. The paper describes an example with four training patterns and a layer of three binary units. Two units are sufficient to encode the four binary patterns, so one of the three can be eliminated. It is possible for this to introduce linear inseparability (by requiring the following layer to do an XOR of the two units, for example), so a provision for adding hidden layers is included. This tends to convert short, wide networks to longer, narrower ones.

In a demonstration problem, the procedure was able to find relatively small networks that solved the test problem and generalized well. In comparison, randomly initialized networks of the same size (after pruning) were unable to learn the problem reliably. It was also observed that training with noise tends to produce networks which are harder to prune in the sense that fewer elements can be removed.

13.4.2 Local Bottlenecks

Kruschke [228], [230] describes a method in which hidden units "compete" to survive. The degree to which a unit participates in the function computed by the network is measured by the magnitude of its weight vector. This is treated as a separate parameter, the gain, and the weight vector is normalized to unit length. A unit with zero gain has a constant output; it contributes only a bias term to following units and doesn't back-propagate any error to preceding layers.

Units are redundant when their weight vectors are nearly parallel or antiparallel and they compete with others that have similar directions. The gains g are adjusted according to

(13.29)
(13.30)

where γ is a small positive constant,

is the unit vector in the direction wsi,
denotes the inner product, and the superscript s indexes the pattern presentations. If node i has weights parallel to those of node j then the gain of each will decrease in proportion to the gain of the other and the one with the smaller gain will be driven to zero faster. The gains are always positive so this rule can only decrease them. (If equation 13.30 results in negative gains, they are set to 0.) Once a gain becomes zero, it will remain zero so the unit can be removed.

Gain competition is interleaved with back-propagation. Because back-propagation modifies the weights, the gains are updated and the weights renormalized after each training cycle.

This method effectively prunes nodes by driving their gains to zero. The parameter γ sets the relative importance of the gain competition and back-propagation. As usual, some tuning may be needed to balance error reduction and node removal. If γ is large, competition will dominate error reduction and too many nodes may be removed.

Sperduti and Starita [354] describe a similar pruning method in conjunction with the use of gain-scaling for faster training. As with weight decay, gain decay lets the gains of unneeded nodes to decrease to zero; the node outputs become effectively constant and can be removed after adjusting biases of nodes to which they send output weights.

13.4.3 Distributed Bottlenecks

Kruschke proposes another solution that puts constraints on the weights rather than pruning them [228], [229]. The network starts with a large hidden layer of random weights and the dimensionality of the weight matrix is reduced during training. The number of nodes and weights remains the same, but the dimension of the space spanned by the weight vectors is reduced so the network behaves somewhat like a smaller network. The dimensionality reduction has an effect similar to pruning, but preserves redundancy and fault tolerance.

The method operates by spreading apart vectors that are farther apart than average and bringing together vectors that are closer together than average. Let dij = ||wi - wj|| be the distance between vectors wi and wj. The process starts with H vectors with a mean of zero and an initial mean separation of D. At each step, the mean distance is

(13.31)

This calculation is nonlocal. The same paper describes a local method that works for the encoder problem but may not work for other problems.

After each back-propagation cycle, the weights are modified by

(13.32)

If

then wi is shifted away from wj. If
then wi is shifted toward wj. The vectors are then recentered and renormalized so that their mean is again zero and their mean separation is D, the initial mean distance. This is equivalent to doing gradient descent of the error function on the constraint surface.

In equation 13.32, β is a small positive constant that controls the relative importance of back-propagation and dimensional compression. If β is too large, all the vectors collapse into two antiparallel bundlesa single dimensionand effectively act like one node.

13.4.4 Principal Components Pruning

A similar idea is considered by Levin, Leen, and Moody [242]. Rather than actually eliminating links, the effective number of parameters is reduced by reducing the rank of the weight matrix for each layer.

The procedure starts with a trained network. For each layer starting with the first and proceeding to the output:

  1. Calculate the correlation matrix Σ for the input vector to the layer

    (13.33)

    where u(k) is the column vector of outputs of the previous layer for pattern k.

  2. Diagonalize Σ = CTDC to obtain C and D, the matrices of eigenvectors and eigenvalues of Σ. Rank the principal components in decreasing order.

  3. Use a validation set to determine the effect of removing an eigennode. That is, set the least nonzero eigenvalue to zero and reevaluate the error. Keep the deletions that do not increase the validation error.

  4. Project the weights onto the l dimensional subspace spanned by the remaining eigenvectors

    where the columns of C are the eigenvectors of the correlation matrix.

  5. Continue with the next layer.

Advantages of the algorithm are that it is relatively easy to implement and reasonably fast. The dimensions of Σ are determined by the number of nodes in a layer rather than the number of nodes or weights in the entire network so the matrix sizes may be more reasonable. Retraining after pruning is not necessary.

This procedure falls between OBD (optimal brain damage) and OBS (optimal brain surgeon) in terms of its use of the Hessian information. OBS uses a linearized approximation of the full Hessian matrix that improves accuracy but makes it impractical for large networks. OBD uses a diagonal approximation of the Hessian that is fast but inaccurate; errors may be large and subsequent retraining may be necessary. This method effectively uses a linear block-diagonal approximation of the Hessian and has a computational cost intermediate between OBD and OBS.

13.4.5 Pruning by the Genetic Algorithm

In a different approach, Whitley and Bogart [397] describe the use of the genetic algorithm to prune a trained network. Each population member represents a pruned version of the original network. A binary representation can be used with bits set to 0 or 1 to indicate whether a weight in the reference network is pruned or not. After mating, the offspring (probably) represent differently pruned networks. They are retrained for a small number of cycles to allow them to fix any damage that may have occurred. As a reward for using fewer weights, heavily pruned networks are given more training cycles than lightly pruned networks. The networks are then evaluated on the error achieved after training. This favors small networks, but not if they reduce size at the cost of increasing error.

As described, each pruned net begins retraining with weights from the original unpruned network. They suggest that it might be better to inherit the weights from the parents so that more drastically pruned networks don't have to adapt to such a large step in a single generation.

They also allow direct short-cut connections from input to outputsomething perfectly valid for back-propagation, but sometimes not considered by experimentersand suggest that this speeds up learning and makes it less likely to be trapped in local minima. This also allows removal of unnecessary hidden layers and could be applied to most of the other methods described. The simple example in figure 13.1, for instance, would benefit from this.