17.5
Remarks
Training with jitter, error regularization, gain scaling,
and weight decay are all methods that have been proposed to improve
generalization. Training with small amounts of jitter approaches the
generalization problem directly by assuming that slightly different inputs give
approximately the same output. If the noise distribution is smooth, the network
will interpolate among training points in proportion to a smooth function of the
distance to each training point.
With jitter, the effective target function is a smoothed version
of the discrete training set. If the training set describes the target function
well, the effective target approximates a smoothed version of the actual target
function. The result is similar to training with a regularized objective
function favoring smooth functions and the noise variance playing the role of
the regularization parameter. Where regularization works by modifying the
objective function, training with jitter achieves the same result by modifying
the training data. In hindsight, the fact that training with noisy data
approximates regularization is not surprising because this is the sort of thing
regularization was developed to address.
Although large networks generally learn rapidly, they tend to
generalize poorly because of insufficient constraints. Training with jitter
helps to prevent overfitting by providing additional constraints. The effective
target function is a continuous function defined over the entire input space
whereas the original target function may be defined only at the specific
training points. This constrains the network and forces it to use any excess
degrees of freedom to approximate the smoothed target function rather than
forming an arbitrarily complex boundary that just happens to fit the original
training data (memorization). Even though the network may be large, it models a
simpler system.
The expected effect of jitter can be calculated efficiently in
some cases by a simple scaling of the node gains. This suggests the possibility
of a post-training step to choose optimum gains based on cross-validation with a
test set. This might make it possible to improve the generalization of large
networks while retaining the advantage of fast learning.
The problem of choosing an appropriate noise variance has not
been addressed here. Holmstrom and Koistinen suggest several methods based on
cross-validation. Considerable research has been done on the problem of
selecting an appropriate λ for regularization, especially for linear models.
Because of the relationship between training with jitter and regularization, the
regularization research may be helpful in selecting an appropriate noise
level.