| |||||
| |||||
Regularization is another method often used to improve generalization. In regularization, one often assumes that the target function is smooth and that small changes in the input do not cause large changes in the output. Poggio and Girosi [300], for example, suggest the cost function
where ‘P is usually a differential operator’ and λ balances the trade-off between smoothing and minimizing the error.
Jittering the inputs while keeping the target fixed embodies this smoothness assumption and results in a similar cost function. That is, we add small amounts of noise to the input data, assume that the target function does not change much, and minimize
where {u} indicates the expected value of u over the training patterns and <u> indicates the expected value of u over the noise n. For small magnitude noise, ||n|| ≈ 0, the network output can be approximated by the linear terms of a truncated Taylor series expansion
where g = ∂y/∂x is the gradient of the output with respect to the input. (A second-order approximation is given in appendix C.1.
Substitution into equation 17.11 and dropping the independent variable for brevity gives
Assume zero-mean uncorrelated noise with equal variances, <n> = 0 and <n nT> = Σ2I. Then
The term {(t - y)2} = E is the conventional unregularized error function and the term {||g||2} is the squared magnitude of the gradient of y(x) averaged over the training points.
ε is an approximation to the regularized error function in equation 17.9. Like equation 17.9, it introduces a term which encourages smooth solutions [384], [42]. Comparison of equations 17.15 and 17.9 shows that Σ2 plays a role similar to λ in the regularization equation, balancing smoothness and error minimization. They differ in that training with jitter minimizes the gradient term at the training points whereas regularization usually seeks to minimize it for all x.
Equation 17.15 shows that, when it can do so without increasing the conventional error, the system minimizes sensitivity to input noise by reducing the magnitude of the gradient of the transfer function at the training points. A similar result is derived in [260] and, by analogy with the ridge estimate method of linear regression, in [259]. A system that explicitly calculates and back-propagates similar terms in a multilayer perceptron is described by Ducker and Le Cun [112]. A more general approach using the Hessian information is described by Bishop [40], [42], [43]. A stronger result equating training with jitter and Tikhonov regularization is reported in [45].
Figure 17.3 illustrates the smoothing effect of training with input jitter. Figure 17.3(a) shows the decision boundary formed by an intentionally overtrained 2/50/10/1 feedforward