| |||||
| |||||
A drawback of training with jitter is that it requires the use of a small learning rate and many sample presentations in order to average over the noise. In certain special cases, the expected response of a network driven by a jittered input can be approximated by simply adjusting the sigmoid slopes. This is, of course, much faster than averaging over the noise. This result provides justification for gain scaling as a heuristic for improving generalization.
Consider the function
where
and g(.) is the node nonlinearity. This describes a single-hidden-layer network with a linear output.
With jitter (and the approximations stated for equation 17.2), the expected output for a fixed input x is
that is, a linear sum of convolutions of the hidden unit responses with the noise density. The symbol * denotes correlation,
In most neural network applications, the nonlinearity is the sigmoid g(z) = 1/(1 + e-z). If, instead, we use the Gaussian cumulative distribution function (GCDF), which has a very similar shape (see figure 17.4), then the shape of the nonlinearity will be invariant to convolution with a Gaussian input noise density. That is, if we assume that the noise is zero-mean Gaussian and spherically distributed in N dimensions
(where ||x||2 = xTX) and the g nonlinearity is the Gaussian cumulative distribution function
then the convolution in equation 17.18 can be replaced by a simple scaling operation
where ak is a scaling constant defined below. A derivation is given in appendix C.2.
The significance of this is that when the equivalence (17.21) holds, the expected response of the network to input noise approximated by (17.18) can be computed exactly by simply scaling the hidden unit nonlinearities appropriately; we do not have to go through the time-consuming process of estimating the response by averaging over many noisy samples. That is,
where the scaling constant ak depends on the magnitude of the weight vector wk and the noise variance
Note that the bias θk is not included in the weight vector and has no role in the computation of ak It is, however, scaled by ak.
This does not say that we can train an arbitrary network without jitter and then simply scale the sigmoids to compute exactly the network that would result from training with jitter because it does not account for dynamics of training with random noise, but it does suggest similarities.
Example Figures 17.5(a) through (f) verify this scaling property. Figures 17.5(a) and (b) show the response of a network with two inputs, three GCDF hidden units, and a linear output unit. Figures 17.5(c) and (d) show the average response using spherically distributed Gaussian noise with and averaged over 2000 noisy samples per grid point. Figures 17.5(e) and (f) show the expected response computed by scaling the hidden units. The RMS error (on a 64 x 64 grid) between the averaged noisy response and the scaled expected response is 0.0145. The scaled expected response was computed in a few seconds; the average noisy response required hours on the same computer.
The scaling operation is equivalent to
Because the denominator is not less than 1, this always reduces the magnitude of w or leaves it unchanged. When σ1 = 0 (no input noise), the weights are unchanged. When
and the magnitude of w is reduced to approximately 1/ σ 1. This has some properties similar to weight decay [299], [388], [387], [308], another commonly used heuristic for improving generalization. The development of weight decay terms as a result of training single-layer linear perceptrons with input noise is shown in [167].
This is supported by figure 17.6, which shows histograms of the weights for the overtrained and jitter-trained networks of figure 17.3. Table 17.1 lists the standard deviations of the weights by layers. The jitter-trained network has smaller weight variance on all levels.
Standard Deviations of Weights | |||
---|---|---|---|
overtrained network | jittered network | number of weights | |
In to H1 weights |
1.1153 | .5904 |
150 |
H1 to H2 weights | .5197 |
.2204 | 510 |
H2 to Out weights |
1.6828 | .4008 |
11 |
All weights | .7262 |
.3481 | 671 |