Skip to Book Content
Book cover image

Chapter 14 - Factors Influencing Generalization

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

14.6 Other Factors

Many other factors have strong effects on the difficulty of a learning task and thus on how well a system can be expected to generalize. The following lists a few items that involve higher-level decisions and generally fall outside the scope of designing a network to fit a given data set. Most are basic principles of good system design.

14.6.1 Choice of Error Function

It hardly needs to be said that the way in which errors are measured has a direct effect on the errors observed. It is generally assumed that identical measures are used for training and test errors, but it is common to choose an error function (e.g., mean squared error) because it is simple and convenient to use even though the real performance may be measured differently (misclassification rate, efficiency, etc.). Poor generalization due to training on one task and testing on another would not be surprising.

Poor performance might result because an inappropriate function is used to measure the error. From a Bayesian viewpoint, different error functions reflect different assumptions about the distribution of the model errors. The mean-squared-error function corresponds to selection of a maximum likelihood model under the assumption that the errors have a Gaussian distribution and is appropriate in ordinary linear regression where the errors are expected to cluster around zero with large errors less likely than small ones. For classification tasks with {0, 1} targets in which the network outputs are viewed as probabilities that the input belongs to a particular class, the cross-entropy error function is generally appropriate. Other error functions are appropriate under different assumptions about the error distribution.

14.6.2 Variable Selection

The selection of input and output variables (i.e., the choice of what information to provide, apart from how it is represented) is an extremely important factor in the difficulty of a learning problem. Certain pieces of information may make a problem very easy. The lack of crucial information may make a problem very difficult or impossible, or change it from a logical problem to a statistical problem. Of course, this is completely problem dependent and falls more in the realm of problem design than network training.

When the chosen variables do not supply needed information, identical inputs may have different targets due to differences in unsupplied variables. Noise in the input patterns can have a similar effect if it destroys necessary information and causes classes to overlap. In either case, the target is not a single-valued function and some error will always remain for any function the network chooses.

Even when the choice of variables does not introduce ambiguity, it still influences the complexity of the learning task and, when training data are limited, may determine if the data are sufficient to describe the target. If the target function is so complex when expressed in terms of the given variables that the available data are insufficient to describe it, then poor generalization could result. Another choice of variables might make the problem simple enough so that the data are adequate and good generalization is possible. The two-spirals problem [233] (illustrated in figure 12.3) is a hard benchmark problem for MLP networks in Cartesian coordinates, but easy in cylindrical coordinates.

It is also possible to confuse a network by supplying too much information in the form of redundant or irrelevant inputs. These increase the number of parameters in the system without supplying much usable information. Irrelevant inputs supply no useful information by definition, but when sample sizes are small the irrelevant inputs may have spurious correlations with the targets. More data will be required to demonstrate that they actually are irrelevant.

14.6.3 Variable Representation

The choice of how variables are represented to the network is also an important factor in learning difficulty. In cases where the network must interface with an external system, the choice may be fixed; in other cases, the representation is a free parameter. There is often a trade-off between economy of representation and decoding complexity. A one-dimensional variable could be coded by the activity on a single input unit; this is economical, but may make learning difficult if the function depends on it in a complex way. The use of a fine-grained "thermometer" code, on the other hand, might make learning easy, but cause generalization to suffer because the number of weights increases while the number of training samples stays fixed.

It is often desirable to choose representations that are invariant to certain irrelevant transformations; for example, invariance to shifts, scale, color, small rotations, and so on can be useful in character recognition. Of course, pre- or postprocessing may be needed to connect the network to the raw data and the cost has to be balanced against how much it simplifies the learning problem. Improvement in generalization due to the use of error-correcting output representations is suggested by Dietterich and Bakiri [107].

The choice of internal representation is also important and is determined in a broad way by the selection of the network structure. Local internal representations (as in radial basis functions, Kohonen maps, etc.) often make learning easy, but often do not generalize as well as global internal representations (e.g., sigmoidal hidden units). Most of these notes apply to any network architecture, but the focus here is on layered sigmoidal networks, so these differences in architecture will not be considered.

14.6.4 Modularity

Many practical problems can be partitioned into independent subproblems. If the system designer knows this, then the information should be incorporated in the network structure rather than requiring the network to learn it from the examples. Separate networks can then be trained independently for each subproblem and combined. The result is (1) shorter training times because each subnetwork is smaller, and (2) better generalization because each subnetwork is better constrained by the available examples. Say, for example, that a problem has two input variables, x1 and x2, and can be separated into two independent subproblems y1(x1) and y2(x2). If each input can take m values, then O(m) examples describe each function adequately. To train a single network to solve both problems at once, O(m2) examples would be needed to describe the system adequately. If there are few examples and spurious correlations exist between y2 and x1, for example, the network is likely to take advantage of them and generalize poorly as a result.