### Chapter 13 - Pruning Algorithms

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II

## 13.2 Sensitivity Calculation Methods

### 13.2.1 Sensitivity Calculations I (Skeletonization)

Mozer and Smolensky [275] describe a method that estimates which units are least important and deletes them during training. A measure of the relevance, ρ, of a unit is the error when the unit is removed minus the error when it is left in place. Instead of calculating this directly for each and every unit, ρ is approximated by introducing a gating term α in the node function

 (13.1)

where oj is the activity of unit j, wij is the weight to unit i from unit j, and f is the node nonlinearity. If αj = 0, unit j has no influence on the network and can be removed; if α = 1, the unit behaves normally. The importance of a unit is then approximated by the derivative

 (13.2)

which can be computed by back-propagation. This is evaluated at α = 1 so α is merely a notational convenience rather than a parameter that must be implemented in the net. When ρi falls below a certain threshold, the unit can be deleted.

The usual sum of squared errors is used for training. To measure relevance, the function E

 (13.3)

is used rather than the sum of squared errors because it provides a better estimate of relevance when the error is small. Exponential averaging is used to suppress fluctuations

 (13.4)

Segee and Carter [338] study the effect of this pruning method on the fault tolerance of the system. Interestingly, they found that the pruned system is not significantly more sensitive to damage even though it has fewer parameters. When the increase in error is plotted as a function of the magnitude of a weight deleted by a fault, the plots for the pruned and unpruned networks are essentially the same. They report that the variance of the weights into a node is a good predictor of the node's relevance and that the relevance of a node is a good predictor of the increase in RMS error expected when the node's largest weight is deleted.

### 13.2.2 Sensitivity Calculations II

Karnin [207] measures the sensitivity of the error function with respect to the removal of each connection and prunes the weights with low sensitivity. The sensitivity of weight wij is given as

 (13.5)

where wf is the final value of the weight after training, 0 is its value upon removal, and E(0) is the error when it is removed.

Rather than actually removing the weight and calculating E(0) directly, they approximate S by monitoring the sum of all the changes experienced by the weight during training.

The estimated sensitivity is

 (13.6)

where N is the number of training epochs and wi is the initial weight. All of these terms are available during training so the expression is easy to calculate and does away with the need for a separate sensitivity calculation phase. When οw is calculated by back-propagation, this becomes

 (13.7)

If momentum is used, the general expression in equation 13.6 should be used.

After training, each weight has an estimated sensitivity and the lowest sensitivity weights can be deleted. Of course, if all output connections from a node are deleted, the node itself can be removed. If all input weights to a node are deleted, the node output will be constant so the node can be deleted after adjusting for its effect on the bias of following nodes.

### 13.2.3 Sensitivity Calculations III (Optimal Brain Damage)

Le Cun, Denker, and Solla [91] describe the optimal brain damage (OBD) method in which the saliency of a weight is measured by estimating the second derivative of the error with respect to the weight. They also reduce network complexity significantly by constraining groups of weights to be equal.

When the weight vector w is perturbed by an amount δw, the change in error is approximately

 (13.8)

where δwi is the ith components of δw, gi is component i of the gradient of E with respect to w, and the hij are elements of the Hessian matrix H

Because pruning is done on a well-trained network, the gradient is nearly zero and the first term in equation 13.8 can be ignored. When the perturbations are small, the last term will also be negligible. Because H may be a very large matrix, they make the simplifying assumption that the off-diagonal terms are zero. This leaves

 (13.9)

The second derivatives hkk can be calculated by a modified back-propagation rule in about the same amount of time as one back-propagation epoch. The saliency of weight wk is then

 (13.10)

Pruning is done iteratively, that is, train to a reasonable error level, compute saliencies, delete low saliency weights, and resume training. Application to pruning of tapped-delay networks is described in [365], in which the decision to stop pruning is based on the estimated generalization ability as determined by a modified AIC.

### 13.2.4 Sensitivity Calculations IV (Optimal Brain Surgeon)

For simplicity, the optimal brain damage method (section 13.2.3) assumes the Hessian is diagonal, but this is rarely true. In the optimal brain surgeon (OBS) method, Hassibi and Stork [157], [159], [158] use a linear approximation of the full Hessian to obtain better estimates of the saliencies.

When the weight vector w is perturbed by an amount δw, the change in the error is approximately

 (13.11)

where H 2E/w2 is the Hessian matrix of second derivatives. Again, the first term is assumed to be near zero since E is at a minima.

For each weight wq that could be set to zero, there is an associated adjustment of the remaining weights needed to reminimize the error. The algorithm combines these steps and seeks the weight adjustment vector δw, which sets one weight wq to zero and causes the least increase in the error. The constraint that δw sets element wq to zero is expressed as eTq . δw + wq = 0 where eq is the unit vector corresponding to weight wq. The goal is to solve

 (13.12)

Using the Lagrangian

 (13.13)

the optimal adjustment vector and resulting change in error are

 (13.14)

and

 (13.15)

Starting with a trained network, the procedure is: calculate H-1 and find the q with the smallest saliency Lq = w2q/(2[H-1]qq); if the change in error, Lq, is acceptable then delete weight q and adjust the remaining weights with δw (equation 13.14). Recalculate H-1 and repeat. Stop when Lq becomes too large and no more weights can be deleted without causing an unacceptable increase in error. Additional training may allow more weights to be deleted later.

The advantage of this method is that the remaining weights are readjusted automatically without the need for incremental retraining so the error introduced by eliminating the weight is generally lower. Unlike most other sensitivity methods, this method can account for correlated elements. Possible drawbacks are the time and memory needed to calculate and store H-1. Although the matrix inversion can be avoided by recursive calculations (using the assumption that

0), the matrix may be large; O(W2) elements are needed for a network with W weights. The network of figure 14.5, for example, with 671 weights, would require 449,000 elements. In practice, some approximation of the Hessian is needed.

Some of the storage and computational objections are addressed in [158]. The recursive procedure for generating H-1 is generalized to any twice-differentiable error function and a dominant eigenspace decomposition of the inverse Hessian is described.

One possible objection to the optimal brain damage and optimal brain surgeon methods is that they are based on a Taylor series approximation, which is reasonable only for small perturbations. Complete removal of a weight is not necessarily a small perturbation, so the calculated saliency may be a poor predictor of the actual change in error. It is easy to generate plots which show that the error surface is usually not well-approximated by a quadratic at large scales.

A minor related problem is that the gradient is assumed to be zero because the network is at a minima, but this is true only if the network is trained to complete convergence. With algorithmic variations such as early stopping and weight clipping, this may not be true. (Weight clipping puts limits on the maximum allowed weight magnitudes.) Many of the networks described in this text were trained until certain error levels were reached (e.g., all outputs correct to within a tolerance band) and exact disappearance of the gradient was not achieved. This is a minor point, however, because it is easy to include the gradient term of equation 13.8 in the salience calculation. Another practical consequence of lack of convergence to a minimum is that the Hessian may not be positive definite.

These problems do not appear to be too serious for initial pruning of an underconstrained network. Figure 13.2 shows a plot of the actual change in error observed when a weight was deleted vs. the 'optimal brain damage' saliencies for a 10/3/10 network trained on the simple autoencoder problem (10 1-of-10 patterns). For this network, the small saliencies seem to be reasonably good predictors of the change in error. A similar pattern is observed on the 2/50/10/1 networks used elsewhere in these notes. After the network is partially pruned, however, the correspondence between saliency and change is error is not as good (even after retraining and recalculation of the saliencies) and it is possible to find small saliency weights which produce much larger increases in error than larger saliency weights (e.g., saliencies greater than about 0.1 in the figure).