Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

16.8 Training with Noisy Data

Many studies (e.g., [299], [118], [310], [387], [345], [246], [287], [267]) have noted that adding small amounts of input noise (jitter) to the training data often aids generalization and fault tolerance. Training with small amounts of added input noise embodies a smoothness assumption because we assume that slightly different inputs give approximately the same output. If the noise distribution is smooth, the network will interpolate among training points in relation to a smooth function of the distance to each training point.

With jitter, the effective target function is the result of convolution of the actual target with the noise density [307], [306]. This is typically a smoothing operation. Averaging the network output over the input noise gives rise to terms related to the magnitude of the gradient of the transfer function and thus approximates regularization [307], [306], [45].

Training with jitter helps prevent overfitting in large networks by providing additional constraints because the effective target function is a continuous function defined over the entire input space whereas the original target function is defined only at the specific training points. This constrains the network and forces it to use excess degrees of freedom to approximate the smoothed target function rather than forming an arbitrarily complex surface that just happens to fit the sampled training data. Even though the network may be large, it models a simpler system.

Training with noisy inputs also gives rise to effects similar to weight decay and gain scaling. Gain scaling [228], [171] is a heuristic that has been proposed as a way of improving generalization. (Something like gain scaling is also used in [252] to "moderate" the outputs of a classifier.) Effects similar to training with jitter (and thus similar to regularization) can be achieved in single-hidden-layer networks by scaling the sigmoid gains [305], [306]. This is usually much more efficient than tediously averaging over many noisy samples. The scaling operation is equivalent to

where σ² is the variance of the input noise. This has properties similar to weight decay. The development of weight decay terms as a result of training single-layer linear perceptrons with input noise is shown in [167]. Effects of training with input noise and its relation to target smoothing, regularization, gain scaling and weight decay are considered in more detail in chapter 17.

Chapter 16 - Heuristics for Improving Generalization

16.8 Training with Noisy Data