Skip to Book Content
Book cover image

Chapter 17 - Effects of Training with Noisy Inputs

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

17.3 Training with Jitter and Sigmoid Scaling

A drawback of training with jitter is that it requires the use of a small learning rate and many sample presentations in order to average over the noise. In certain special cases, the expected response of a network driven by a jittered input can be approximated by simply adjusting the sigmoid slopes. This is, of course, much faster than averaging over the noise. This result provides justification for gain scaling as a heuristic for improving generalization.

17.3.1 Linear Output Networks

Consider the function

(17.16)

where

(17.17)

and g(.) is the node nonlinearity. This describes a single-hidden-layer network with a linear output.

With jitter (and the approximations stated for equation 17.2), the expected output for a fixed input x is

(17.18)
Click To expand
Figure 17.4: The conventional sigmoid 1/(1 + e-x) and the Gaussian cumulative distribution function (GCDF) (with Σ = 4/2π) have very similar shapes and give similar results when used as the node nonlinearities. The GCDF is useful in this analysis because it is shape invariant when convolved with a spherical Gaussian noise density.

that is, a linear sum of convolutions of the hidden unit responses with the noise density. The symbol * denotes correlation,

which is different from convolution but the operations can be interchanged here if pn(x) is symmetric.

In most neural network applications, the nonlinearity is the sigmoid g(z) = 1/(1 + e-z). If, instead, we use the Gaussian cumulative distribution function (GCDF), which has a very similar shape (see figure 17.4), then the shape of the nonlinearity will be invariant to convolution with a Gaussian input noise density. That is, if we assume that the noise is zero-mean Gaussian and spherically distributed in N dimensions

(17.19)

(where ||x||2 = xTX) and the g nonlinearity is the Gaussian cumulative distribution function

(17.20)

then the convolution in equation 17.18 can be replaced by a simple scaling operation

(17.21)

where ak is a scaling constant defined below. A derivation is given in appendix C.2.

The significance of this is that when the equivalence (17.21) holds, the expected response of the network to input noise approximated by (17.18) can be computed exactly by simply scaling the hidden unit nonlinearities appropriately; we do not have to go through the time-consuming process of estimating the response by averaging over many noisy samples. That is,

(17.22)

where the scaling constant ak depends on the magnitude of the weight vector wk and the noise variance

(17.23)

Note that the bias θk is not included in the weight vector and has no role in the computation of ak It is, however, scaled by ak.

This does not say that we can train an arbitrary network without jitter and then simply scale the sigmoids to compute exactly the network that would result from training with jitter because it does not account for dynamics of training with random noise, but it does suggest similarities.

Example Figures 17.5(a) through (f) verify this scaling property. Figures 17.5(a) and (b) show the response of a network with two inputs, three GCDF hidden units, and a linear output unit. Figures 17.5(c) and (d) show the average response using spherically distributed Gaussian noise with and averaged over 2000 noisy samples per grid point. Figures 17.5(e) and (f) show the expected response computed by scaling the hidden units. The RMS error (on a 64 x 64 grid) between the averaged noisy response and the scaled expected response is 0.0145. The scaled expected response was computed in a few seconds; the average noisy response required hours on the same computer.

17.3.2 Relation to Weight Decay

The scaling operation is equivalent to

Because the denominator is not less than 1, this always reduces the magnitude of w or leaves it unchanged. When σ1 = 0 (no input noise), the weights are unchanged. When

Click To expand
Figure 17.5: Equivalence of weight scaling and jitter averaging: (a) the transfer function of the original network and (b) its contour plot; (c) the average response with additive Gaussian input noise, σ = 0.1, averaged over 2000 noisy samples per grid point and (d) its contour plot; and (e) the expected response computed by scaling and (f) its contour plot.
Click To expand
Figure 17.6: Weight-decay effects of training with jitter: (a) weights for the overtrained network of figure 17.3(a), σ = 0.7262; and (b) weights for the jitter-trained network of figure 17.3(b), σ = 0.3841.
σ1 δ the weights approach zero. When ||w|| σ 1 is small, the scaling has little effect. When ||w|| σ 1 is large, the scaling is approximately

and the magnitude of w is reduced to approximately 1/ σ 1. This has some properties similar to weight decay [299], [388], [387], [308], another commonly used heuristic for improving generalization. The development of weight decay terms as a result of training single-layer linear perceptrons with input noise is shown in [167].

This is supported by figure 17.6, which shows histograms of the weights for the overtrained and jitter-trained networks of figure 17.3. Table 17.1 lists the standard deviations of the weights by layers. The jitter-trained network has smaller weight variance on all levels.

Table 17.1: Weight-decay Effects of Training with Jitter. Training with Jitter Tends to Produce Smaller Weights.
 

Standard Deviations of Weights

 

overtrained network

jittered network

number of weights

In to H1 weights

1.1153

.5904

150

H1 to H2 weights

.5197

.2204

510

H2 to Out weights

1.6828

.4008

11

All weights

.7262

.3481

671