Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

17.3 Training with Jitter and Sigmoid Scaling

A drawback of training with jitter is that it requires the use of a small learning rate and many sample presentations in order to average over the noise. In certain special cases, the expected response of a network driven by a jittered input can be approximated by simply adjusting the sigmoid slopes. This is, of course, much faster than averaging over the noise. This result provides justification for gain scaling as a heuristic for improving generalization.

17.3.1 Linear Output Networks

Consider the function

(17.16)

where

(17.17)

and g(.) is the node nonlinearity. This describes a single-hidden-layer network with a linear output.

With jitter (and the approximations stated for equation 17.2), the expected output for a fixed input x is

(17.18)

Figure 17.4: The conventional sigmoid 1/(1 + e^-x) and the Gaussian cumulative distribution function (GCDF) (with Σ = 4/2π) have very similar shapes and give similar results when used as the node nonlinearities. The GCDF is useful in this analysis because it is shape invariant when convolved with a spherical Gaussian noise density.

that is, a linear sum of convolutions of the hidden unit responses with the noise density. The symbol * denotes correlation,

which is different from convolution but the operations can be interchanged here if p_n(x) is symmetric.

In most neural network applications, the nonlinearity is the sigmoid g(z) = 1/(1 + e^-z). If, instead, we use the Gaussian cumulative distribution function (GCDF), which has a very similar shape (see figure 17.4), then the shape of the nonlinearity will be invariant to convolution with a Gaussian input noise density. That is, if we assume that the noise is zero-mean Gaussian and spherically distributed in N dimensions

(17.19)

(where ||x||2 = x^TX) and the g nonlinearity is the Gaussian cumulative distribution function

(17.20)

then the convolution in equation 17.18 can be replaced by a simple scaling operation

(17.21)

where a_k is a scaling constant defined below. A derivation is given in appendix C.2.

The significance of this is that when the equivalence (17.21) holds, the expected response of the network to input noise approximated by (17.18) can be computed exactly by simply scaling the hidden unit nonlinearities appropriately; we do not have to go through the time-consuming process of estimating the response by averaging over many noisy samples. That is,

(17.22)

where the scaling constant a_k depends on the magnitude of the weight vector w_k and the noise variance

(17.23)

Note that the bias θ_k is not included in the weight vector and has no role in the computation of a_k It is, however, scaled by a_k.

This does not say that we can train an arbitrary network without jitter and then simply scale the sigmoids to compute exactly the network that would result from training with jitter because it does not account for dynamics of training with random noise, but it does suggest similarities.

Example Figures 17.5(a) through (f) verify this scaling property. Figures 17.5(a) and (b) show the response of a network with two inputs, three GCDF hidden units, and a linear output unit. Figures 17.5(c) and (d) show the average response using spherically distributed Gaussian noise with and averaged over 2000 noisy samples per grid point. Figures 17.5(e) and (f) show the expected response computed by scaling the hidden units. The RMS error (on a 64 x 64 grid) between the averaged noisy response and the scaled expected response is 0.0145. The scaled expected response was computed in a few seconds; the average noisy response required hours on the same computer.

17.3.2 Relation to Weight Decay

The scaling operation is equivalent to

Because the denominator is not less than 1, this always reduces the magnitude of w or leaves it unchanged. When σ₁ = 0 (no input noise), the weights are unchanged. When

Figure 17.5: Equivalence of weight scaling and jitter averaging: (a) the transfer function of the original network and (b) its contour plot; (c) the average response with additive Gaussian input noise, σ = 0.1, averaged over 2000 noisy samples per grid point and (d) its contour plot; and (e) the expected response computed by scaling and (f) its contour plot.

Figure 17.6: Weight-decay effects of training with jitter: (a) weights for the overtrained network of figure 17.3(a), σ = 0.7262; and (b) weights for the jitter-trained network of figure 17.3(b), σ = 0.3841.

σ₁ → δ the weights approach zero. When ||w|| σ ₁ is small, the scaling has little effect. When ||w|| σ ₁ is large, the scaling is approximately

and the magnitude of w is reduced to approximately 1/ σ ₁. This has some properties similar to weight decay [299], [388], [387], [308], another commonly used heuristic for improving generalization. The development of weight decay terms as a result of training single-layer linear perceptrons with input noise is shown in [167].

This is supported by figure 17.6, which shows histograms of the weights for the overtrained and jitter-trained networks of figure 17.3. Table 17.1 lists the standard deviations of the weights by layers. The jitter-trained network has smaller weight variance on all levels.

Table 17.1: Weight-decay Effects of Training with Jitter. Training with Jitter Tends to Produce Smaller Weights.

	Standard Deviations of Weights
	overtrained network	jittered network	number of weights
In to H1 weights	1.1153	.5904	150
H1 to H2 weights	.5197	.2204	510
H2 to Out weights	1.6828	.4008	11
All weights	.7262	.3481	671

Chapter 17 - Effects of Training with Noisy Inputs

17.3 Training with Jitter and Sigmoid Scaling

17.3.1 Linear Output Networks

17.3.2 Relation to Weight Decay