Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

16.5 Weight Decay

One way to implement a bias for simple or smooth functions is to favor networks with small weights over those with large weights. Large weights tend to cause sharp transitions in the node functions and thus large changes in output for small changes in the inputs. A simple way to obtain some of the benefits of pruning without complicating the learning algorithm much is to add a decay term like -βw to the weight update rule. Weights that are not essential to the solution decay to zero and can be removed. Even if they aren't removed, they have no effect on the output so the network acts like a smaller system. Weight decay rules have been used in many studies, for example, [299], [388], [387], [227]. Several methods are compared by Hergert, Finnoff, and Zimmermann [165].

Weight decay can be considered as a form of regularization (e.g., [227]). Adding a β Σ_i w²_i regularizing term to the cost function, for example, is equivalent to addition of a -βw_i decay term to the weight update rule. A drawback of the Σ_iw²_i penalty t4erm is that it tends to favor weight vectors with many small components over those with a few large components, even when this is an effective choice. An alternative [386], [387], [388] is

(16.4)

When λ is large, this is similar to weight decay methods. For |w_i| <<w_o, the cost is small but grows like w²_i while, for |w_i| > w_o, the cost of a weight saturates and approaches a constant λ. (The developers call this form 'weight elimination' to differentiate it from simple weight decay.)

Soft weight sharing [286], [285] is another method that allows large weights when they are needed by using a penalty term that models the prior likelihood of the weights as a mixture of Gaussians. In practice, a number of Gaussians are used and their centers and widths are adapted to minimize the cost function. This reduces the complexity of the network by increasing the correlation among weight values.

Hard weight sharing is commonly used in image processing networks where the same kernel is applied repeatedly at different positions in the input image. In a neural network, separate hidden nodes may be used to compute the kernel at different locations and the number of weights could be huge. Constraining nodes that compute the same kernel to have the same weights greatly reduces the network complexity [91] decay was set to 1E-4 and training resumed for a total of 5000 epochs. Unlike figure 14.5 the decision surface is simple and smooth and doesn't show obvious signs of overtraining in spite of long training times. The response is basically that of a single sigmoid unit..

Example Figure 16.1 illustrates effects of weight decay. A 2/50/10/1 network was trained on 31 points using normal batch back-propagation (learning rate 0.01, momentum and weight decay 0). The network is very underconstrained. After 200 epochs the weight

Figure 16.1: Effect of training with weight decay. A 2/50/10/1 network was trained using normal back-propagation for 200 epochs. Then weight decay was set to 1E-4 and training resumed for a total of 5000 epochs. Unlike figure 14.5, the decision surface is very simple and does not show obvious signs of overtraining.

Figure 16.2 shows another example. Figure 16.2(a) shows the response of a network trained by normal batch back-propagation (learning rate 0.03, momentum and weight decay 0) until all patterns were correctly classified (error less than 0.1) at about 11,000 epochs. The network is underconstrained and the boundary is complex with steep transitions. Another net was trained with the same initial weights and learning rate but with weight decay increasing from 0 to 1E-5 at 1200 epochs, to 1E-4 at 2500 epochs, and to 1E-3 at 4000 epochs after which it was held constant. Figure 16.2(b) shows the response after 20,000 epochs. The surface is smoother and transitions are more gradual, but it could be argued that the data are still somewhat overfitted. Figure 16.2(c) shows the response after the learning rate was reduced to 0.01 and training resumed for another 1000 epochs. Further smoothing occurs because of the shift in balance between error minimization and weight decay.

In addition to showing the smoothing effects of weight decay, these examples show that the results may be hard to predict a priori. As in other regularization or penalty-term methods, there is a complex interaction between error minimization and constraint satisfaction. The particular value of the weight decay parameter (or regularization parameter in general) determines where equilibria occur, but it is difficult to predict ahead of time what value is needed to achieve desired results. The value 0.001 was chosen rather arbitrarily because it is a typically cited round number, but figure 16.2(b) is still perhaps somewhat overfitted.

Figure 16.2: Effects of weight decay: (a) response of a 2/50/10/1 network trained by batch back-propagation until all patterns were correctly classified at about 11,000 epochs; (b) response after 20,000 epochs of a network trained from the same starting point with weight decay increasing to 0.001 at 4000 epochs; and (c) response of the network in (b) after 1000 more epochs with the learning rate decreased to 0.01.

Chapter 16 - Heuristics for Improving Generalization

16.5 Weight Decay