Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

17.2 Error Regularization and Training with Jitter

Regularization is another method often used to improve generalization. In regularization, one often assumes that the target function is smooth and that small changes in the input do not cause large changes in the output. Poggio and Girosi [300], for example, suggest the cost function

(17.9)

where ‘P is usually a differential operator’ and λ balances the trade-off between smoothing and minimizing the error.

Jittering the inputs while keeping the target fixed embodies this smoothness assumption and results in a similar cost function. That is, we add small amounts of noise to the input data, assume that the target function does not change much, and minimize

(17.10)

(17.11)

where {u} indicates the expected value of u over the training patterns and <u> indicates the expected value of u over the noise n. For small magnitude noise, ||n|| ≈ 0, the network output can be approximated by the linear terms of a truncated Taylor series expansion

(17.12)

where g = ∂y/∂x is the gradient of the output with respect to the input. (A second-order approximation is given in appendix C.1.

Substitution into equation 17.11 and dropping the independent variable for brevity gives

(17.13)

Assume zero-mean uncorrelated noise with equal variances, <n> = 0 and <n n_T> = Σ²I. Then

(17.14)

(17.15)

The term {(t - y)²} = E is the conventional unregularized error function and the term {||g||²} is the squared magnitude of the gradient of y(x) averaged over the training points.

ε is an approximation to the regularized error function in equation 17.9. Like equation 17.9, it introduces a term which encourages smooth solutions [384], [42]. Comparison of equations 17.15 and 17.9 shows that Σ² plays a role similar to λ in the regularization equation, balancing smoothness and error minimization. They differ in that training with jitter minimizes the gradient term at the training points whereas regularization usually seeks to minimize it for all x.

Equation 17.15 shows that, when it can do so without increasing the conventional error, the system minimizes sensitivity to input noise by reducing the magnitude of the gradient of the transfer function at the training points. A similar result is derived in [260] and, by analogy with the ridge estimate method of linear regression, in [259]. A system that explicitly calculates and back-propagates similar terms in a multilayer perceptron is described by Ducker and Le Cun [112]. A more general approach using the Hessian information is described by Bishop [40], [42], [43]. A stronger result equating training with jitter and Tikhonov regularization is reported in [45].

Figure 17.3 illustrates the smoothing effect of training with input jitter. Figure 17.3(a) shows the decision boundary formed by an intentionally overtrained 2/50/10/1 feedforward

Figure 17.3: Smoothing effects of training with jitter: (a) an intentionally overtrained 2/50/10/1 feedforward network chooses an overly complex boundary and can be expected to generalize poorly; (b) the same network trained with Gaussian (Σ = 0.1) input noise forms a much smoother boundary and better generalization can be expected; and (c) the expression in equation 17.5 for the expected value of the target function at an arbitrary point x.

network. With 671 weights, but only 31 training points, the network is very underconstrained and chooses a very nonlinear boundary. Training with jittered data discourages sharp changes in the response near the training points and so discourages the network from forming overly complex boundaries. Figure 17.3(b) shows the same network trained for the same amount of time from the same initial conditions with additive Gaussian input noise (Σ = 0.1). Despite very long training times, the response shows no effects of overtraining. For reference, figure 17.3(c) shows the expected value of the target given the noisy input as calculated in equation 17.5.

Chapter 17 - Effects of Training with Noisy Inputs

17.2 Error Regularization and Training with Jitter