Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

17.4 Extension to General Layered Neural Networks

The results previously discussed relating training with jittered data and regularization hold for any network. The analysis for gain scaling, however, is valid only for networks with a single hidden layer and a linear output node. More general feedforward networks have multiple layers and nonlinear output nodes. Even though the invariance property does not hold for these networks, these results lend justification to the idea of gain scaling [228], [171] and weight decay as heuristics for improving generalization.

The gain scaling analysis uses a GCDF nonlinearity in place of the usual sigmoid nonlinearity. Because these functions have similar shapes, this is not an important difference in terms of representation capability. (Differences might be observed in training dynamics, however, because the GCDF has flatter tails.) The precise form of the sigmoid is usually not important as long as it is monotonic nondecreasing; the usual sigmoid is widely used because its derivative is easily calculated.

The GCDF nonlinearity is used here because it has a convenient shape invariance property under convolution with a Gaussian input noise density. There may be other nonlinearities that, although not having this shape invariance property, are such that their expected response can still be calculated efficiently using a similar approach. If, for example, g(x) ∗ pn(x) = h(x), the function h(x) may be different in form from g(x), but still reasonably easy to calculate. As a specific example, if g(x) is a step function and pn(x) is uniform (both in one dimension), then h(x) is a semilinear ramp function: 0 for x < α, equal to x for -α ≤ x ≤ α, and 1 for x > α. The expected network response can then be computed as a linear sum of h(x) nonlinearities rather than a linear sum of g(x) nonlinearities. Different nonlinearities must be used to calculate the normal and expected responses, but this is still much faster than averaging over many presentations of noisy samples.

The scaling results can also be applied to radial basis functions [271], [272], [300], which generally use Gaussian PDF hidden units and a linear output summation. The convolution of two spherical Gaussian PDFs with variances σ ²₁ and σ ²₂ produces a third Gaussian PDF with variance σ ²₃ = σ ²₁ + σ ²₂, so the expected response of these networks to noise is easily calculated using similar shape invariant scaling.

Chapter 17 - Effects of Training with Noisy Inputs

17.4 Extension to General Layered Neural Networks