Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

17.5 Remarks

Training with jitter, error regularization, gain scaling, and weight decay are all methods that have been proposed to improve generalization. Training with small amounts of jitter approaches the generalization problem directly by assuming that slightly different inputs give approximately the same output. If the noise distribution is smooth, the network will interpolate among training points in proportion to a smooth function of the distance to each training point.

With jitter, the effective target function is a smoothed version of the discrete training set. If the training set describes the target function well, the effective target approximates a smoothed version of the actual target function. The result is similar to training with a regularized objective function favoring smooth functions and the noise variance playing the role of the regularization parameter. Where regularization works by modifying the objective function, training with jitter achieves the same result by modifying the training data. In hindsight, the fact that training with noisy data approximates regularization is not surprising because this is the sort of thing regularization was developed to address.

Although large networks generally learn rapidly, they tend to generalize poorly because of insufficient constraints. Training with jitter helps to prevent overfitting by providing additional constraints. The effective target function is a continuous function defined over the entire input space whereas the original target function may be defined only at the specific training points. This constrains the network and forces it to use any excess degrees of freedom to approximate the smoothed target function rather than forming an arbitrarily complex boundary that just happens to fit the original training data (memorization). Even though the network may be large, it models a simpler system.

The expected effect of jitter can be calculated efficiently in some cases by a simple scaling of the node gains. This suggests the possibility of a post-training step to choose optimum gains based on cross-validation with a test set. This might make it possible to improve the generalization of large networks while retaining the advantage of fast learning.

The problem of choosing an appropriate noise variance has not been addressed here. Holmstrom and Koistinen suggest several methods based on cross-validation. Considerable research has been done on the problem of selecting an appropriate λ for regularization, especially for linear models. Because of the relationship between training with jitter and regularization, the regularization research may be helpful in selecting an appropriate noise level.

Chapter 17 - Effects of Training with Noisy Inputs

17.5 Remarks