Skip to Book Content
Book cover image

Chapter 15 - Generalization Prediction and Assessment

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

15.2 The Bayesian Approach

Bayesian methods provide ways to describe the effects of biases, sampling distributions, noise, and other uncertainties. The Bayesian approach incorporates external knowledge (or biases) about the target function in the form of prior probabilities of different hypothesis functions [251], [253]. The data set D = {(xi, ti), i = 1 m}, where x are the inputs and t the targets, is typically modeled as the sum of a deterministic function f and additive perturbations (noise) n representing prediction errors

(15.1)

Assuming the network output y(x) is correct, the probability of the observed data is the probability that an error ti - y(xi) is due entirely to the noise

(15.2)

If the training cases are independent and the noise is independent and identically distributed, the probability of the entire training set given the assumption f = y is

(15.3)

If the noise is assumed to be Gaussian N(0, σ), then

(15.4)
(15.5)

where Emi=1(ti - y(xi))2 is the usual sum of squared errors. Minimization of the mean-squared-error is thus equivalent to selection of a maximum likelihood model under the assumption that the errors are Gaussian. Other error functions are appropriate under different assumptions about the error distribution; a number are reviewed by Rumelhart et al. [328].

By Bayes' rule, the evidence for a model y(x) given the data is

(15.6)

External constraints (such as a bias toward smooth solutions) are reflected in the choice of model prior probabilities P [f = y]. The denominator P [D] is the same for all models and can be ignored in comparing models.

Different model configurations can be compared by decomposing y into a choice of weights w and a network architecture H. (Hspecifies the number of layers, number of nodes, etc. and w specifies a set of weights in the given architecture.) The probability that a given set of weights is the correct choice given the data and the model H is

(15.7)
Note 

(this is different from the probability that some learning algorithm will produce a particular set of weights.) For a given H, the prior P [w | H] can reflect a bias in favor of small weight values, for example. The probability of different modelsHi is given by [251]

(15.8)

The priors P [Hi] can reflect a bias in favor of models with small numbers of parameters, for example.

If all prior probabilities P [Hi] are approximately equal, then the models can be compared based on the evidence [251]

(15.9)

If w is k-dimensional and if the posterior distribution is approximately Gaussian, then [251], [253]

(15.10)

where wmp is the maximum likelihood set of weights found by minimizing E and A = -∇∇ log P [w | D, Hi] is the Hessian of E with respect to w evaluated at wmp. It has been argued [251], [253] that this approach has a built-in bias for simple models because the Occam factor P [wmp | Hi] (2π)k/2 det A is smaller for more complex models.

Remarks Perhaps in part because of its widespread success, criticisms of the Bayesian approach have been raised. From the viewpoint of the prediction system, approximation errors are random and unpredictable (otherwise it would be able to eliminate them) so errors are treated like noise and usually assumed to be independent and identically distributed. All network functions and most real target functions have structure, however, so errors may not be independent. The errors are often assumed to have some tractable distribution such as Gaussian (justified by the central limit theorem), but approximations are often made that hold only for large sample sizes. A common criticism of Bayesian approaches in general is that the prior probabilities may be subjective (i.e., biases rather than measured probabilities). This is not a major problem in cases where all the probabilities can be measured, or when the analysis is used for qualitative understanding, but may be a problem in quantitative predictions. Many of these criticisms are objections to the way the theory is applied, rather than defects of the theory itself.