Skip to Book Content
Book cover image

Chapter 14 - Factors Influencing Generalization

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

Chapter 14: Factors Influencing Generalization

14.1 Definitions

How is it possible to derive general rules from specific cases? Much of science is the search for simple rules to explain observed events. Generalization is an ancient philosophical problem that has been studied from many angles.

In the context of artificial neural networks and supervised learning, generalization is often viewed as an interpolation or approximation problem. That is, the examples are seen as points in a space and the goal is to find a function that interpolates between them in a reasonable way. If the data are noisy or uncertain, constraints may be relaxed to require only that the surface be 'near' the training points.

Other definitions of generalization are also possible. Generalization can be said to occur in associative memories or clustering systems that learn ideal class prototypes after training on specific instances of each class. (In some applications, this may be considered a defect because details of the individual patterns are forgotten.) The interpolation viewpoint also ignores the perhaps more realistic case of reinforcement learning where data occurs in the form of input-response-consequence triplets and there is no teacher or single-valued target function.

The following definition of generalization is used below. The training data consist of examples of the desired input-output relationship, {(xk, tk)}, k = 1, M, where M is the number of training samples, xk is the kth input pattern, and tk is the corresponding desired output or target. Usually tk is considered to be a value generated by some unknown target function f(xk) or a sample drawn from an unknown joint distribution p(t, x). For input xk the network produces an output yk = y(xk) in response. Differences between the network output and desired target are measured by some error function E, often the mean squared error (MSE)

(14.1)

where E [.] denotes expectation. Usually, the true error cannot be measured exactly. When only samples are available, an approximation based on the training set error

(14.2)

is often used. Training consists of selecting a set of parameters that (hopefully) minimize the error on all future tests.

With no restrictions on the learning system, we can always find a function that fits any finite data set exactly. If nothing else, we can simply store the training patterns in a look-up table. The problem is that although the look-up table "learns" the training data perfectly, it cannot cope with novel patterns. What we really want is for the system to generalize from the training examples to the underlying target function so it produces correct (or at least reasonable) outputs in response to new patterns that have not been seen before. A system that learns the training data and also does well on new data is said to generalize well. It fails to generalize when it performs poorly on new data.