Skip to Book Content
Book cover image

Chapter 17 - Effects of Training with Noisy Inputs

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

17.6 Further Examples

17.6.1 Static Noise

The use of dynamic jitter may interfere with some training algorithms because the measured error changes from moment to moment due to the jitter. Algorithms that adapt the learning rate depending on the change in error from one iteration to the next or algorithms that use information from previous iterations to choose the next search point could become unstable. It may also be inconvenient to add dynamic jitter to the data in closed simulation systems. In these cases it may be useful to use static noise, that is, to create a larger fixed training set by adding noisy versions of the original patterns.

Figure 17.7 illustrates the effect of training with a static noisy data set. Figure 17.7(a) shows the surface learned by a 2/50/10/1 network trained on the original 31 data points (724 epochs with RProp). The 31 points are almost linearly separable, but with 671 weights the network is very underconstrained and chooses a complex decision surface with sharp transitions. A static noisy data set of 930 points was generated by perturbing each of the original points with Gaussian noise (σ = 0.1) 30 times. (Thirty was chosen to give more training patterns than weights. Simulations using 5 and 10 noisy patterns per original point yielded complex boundaries.) The original points were not included in the new training set. Figure 17.7(b) shows the surface learned by a network initialized with the same weights

Click To expand
Figure 17.7: Training with static noise: (a) response of an underconstrained 2/50/10/1 net. The boundary is complex and transitions are steep, but the data is almost linearly separable. (b) Response of the same net trained on an enlarged data set obtained by replacing each original training point by 30 noisy points (σ = 0.1). The boundary is simpler and transitions are more gradual, but a few kinks remain. (1s and 0s denote the training points, the training values were 0.9 and -0.9.)
Click To expand
Figure 17.8: Cross-validation with jittered data. An artificial validation data set was created by generating 30 jittered points from each of the original 31 training points: (a) response of an underconstrained 2/50/10/1 net trained to convergence, and (b) response of the net with the best RMS error on the validation set (1s and 0s denote the training points, the training values were 0.9 and -0.9).

after 2500 epochs of RProp training. In most places, the decision surface is less complex and the transitions are more gradual, but a few kinks remain. Evidently the network was still able to exploit idiosyncrasies in the data so perhaps 930 points was not enough to constrain the network enough to prevent overtraining. (The second network was trained for a much longer time, however: 2,325,000 pattern presentations versus 22,444. The kinks might not have developed if the net were trained for an equivalent number of pattern presentations or an equivalent number of epochs, but this indicates that the augmented data by itself was not enough to prevent overtraining.)

17.6.2 Cross-Validation with Jittered Data

An artificial validation data set was created by generating 30 jittered points from each of the original 31 training points. Figure 17.8(a) shows the response of the network trained to convergence. Overfitting is obvious. Figure 17.8(b) shows the response of the network with the best validation error. Final convergence of the overtrained net occurred at 1365 epochs. The best validation was observed at 165 epochs.

More sophisticated versions of this approach are described by Musavi et al. [281] and Pados and Papantoni-Kazakos [293]. In both, the joint density f((X,Y) is estimated by fitting Gaussians (not necessarily spherical) around each point. This is done by a radial basis function network in [293]. The resulting density estimate can then be used to estimate the generalization error of another network or, as in section 17.6.1, to generate a larger set of artificial training data.

Figure 17.9: Smoothing an overtrained response. Given an overtrained net, a better estimate of the true function at a point x might be obtained by averaging a number of probes around x using a oisy input. The figure shows the expected response of the network in figure 17.7(a) to a noisy input (σ = 0.1) (1s and 0s denote the training points, the training values were 0.9 and -0.9).

17.6.3 Jitter Used to Discount an Overtrained Response

When system response time is not a critical consideration, averaging with jitter might be used to smooth the output of an overtrained network to obtain a more reliable response. That is, given an overtrained net, a better estimate of the true function at a point x might be obtained by averaging a number of probes around x using a noisy input

Figure 17.9 shows the expected response of the overtrained network in figure 17.7(a) to a noisy input (σ = 0.1). Unlike training with dynamic jitter, which slows training by requiring a small learning rate, or training with static jitter, which slows training by increasing the size of the training set, this allows fast training but mitigates the worst effects of overtraining at the expense of a slightly slower response during recall.