Skip to Book Content
Book cover image

Chapter 17 - Effects of Training with Noisy Inputs

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

17.1 Convolution Property of Training with Jitter

Consider a network trained with noisy input data, {x + n, t(x)}, where n is noise that varies with each presentation. During training, the network sees the clean target t(x) in conjunction with the noisy input.x^~ = x+n The input x^~ seen by the network may be produced by various combinations of inputs x and noises n, while the target depends only on x. Various targets may therefore be associated with the same noisy input x^~. The network, however, can produce only a single output for any given input. For arbitrary noise and input sampling distributions, the effective target for a given input x^~ is the expected value of the target given the noisy input

(17.1)

In the special case where the distribution px of the training inputs is uniform and the standard deviation of the noise is small relative to the extent of the input domain, the interaction between px and pn will have little effect in the interior of the domain. In regions where these boundary effects can be ignored, the px, terms cancel, the denominator integrates to one, and this simplifies to the approximation

(17.2)
Click To expand
Figure 17.1: Convolution tends to be a smoothing operation. A step function, t(x), convolved with a Gaussian, pn(x), produces the Gaussian cumulative distribution. This resembles the original step function, but it is a smooth function similar to the sigmoid.

Thus, in this special case, the effective target when training with jittered input data is approximately equal to the convolution of the original target t(x) and the noise density pn (x).

Convolution tends to be a smoothing operation in general. If, for example, a step function, t(x), is convolved with a Gaussian, pn(x), the result is the Gaussian cumulative distribution which is a smoothed step function similar to the sigmoid (figure 17.1). This convolutional property resulting from jittered sampling is described by Marks [257].

Holmstrom and Koistinen [174], [214], [215] showed that training with jitter is consistent in that, under appropriate conditions, the resulting error function approaches the true error function as the number of training samples increases and the amount of added input noise decreases.

17.1.1 Effective Target for Sampled Data

The convolution property holds when training data are continuously and uniformly distributed over the entire input space and the magnitude of the noise is small. In practice, however, we usually have only a finite number of discrete samples {(xi, ti)} of the underlying function and the samples are not uniformly distributed in general. In this case, the distribution of x~ = xi + n is not uniform and the optimum output function is modified.

Let the training set be {(xi ti) | i = 1. M}. During training, we randomly select one of the training pairs with equal probability, add noise to the input vector, and apply it to the network. Given that the training point is xk, a randomly selected point from the training set, the probability density of the noisy input x~ is

Training points are selected from the training set with equal probabilities P [x = xk] = 1/M so the probability density of the input seen by the network, x~, is

(17.3)

Given that a particular noisy input x~ is observed, the probability that it is generated by training data xk plus noise is found by Bayes' rule

(17.4)

Let Pk denote this probability.

The expected value of the training target, given the noisy training input x~, is then

(17.5)

This is the expected value of the training target given that the input is a noisy version of one of the training samples. As the number of samples approaches , the distribution of the samples approaches px and equation 17.5 becomes a good approximation to equation 17.1.

Let y(x~) be the network output for the input x~. The expected value of the error while training, given this input, is

(17.6)

Abbreviate y(x~) with y. After expanding the square,

(17.7)

In other words, under gradient descent, the system acts as if the target function is ΣitiPi, the expected value of the target in equation 17.5, given the conditions stated for Pi. This is a well-known result in optimal least squares estimation: the function that minimizes the sum-of-squares error is the expected value of the target given the input. From equation 17.7 the effective error function is

(17.8)

in the sense that

In contrast to conventional training where the target is defined only at the training points, the effective target when training with jittered inputs is a function defined for all inputs x.

Figure 17.2 illustrates the point. Figure 17.2(a) shows the Voronoi map of a set of points in two dimensions, the basis for a nearest neighbor classifier. Figure 17.2(b) is expression(17.5) for the expected target given the noisy input. Figure 17.2(c) shows the convolution of the sampled target function with a Gaussian function; the convolution smooths the nearest neighbor decision surface and removes small features. Note that the zero contours in figures 17.2(b) and 17.2(c) coincide.

Click To expand
Figure 17.2: A nearest neighbor classification problem: (a) the Voronoi map for 14 points; and (b) the expected value of the classification given the noisy input as calculated in (17.5) for a Gaussian noise distribution with Σ = 0.1. (c) The convolution of the training set with the same Gaussian noise. The zero contours of (b) and (c) coincide. (Is and Os indicate classes; values 1 and -1 were actually used.)