Skip to Book Content
Book cover image

Chapter 16 - Heuristics for Improving Generalization

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

16.7 Replicated Networks

Another idea for improving generalization is to combine the outputs of several systems that differ in how they classify novel examples [245], [298], [189], [238], [63], [36], [111]. (Though not about neural networks per se [83] surveys many methods for combining forecasts.) The subsystems may differ due to variations in configuration, size, initialization, variations in the learning algorithm, differences in training data, and so on, or because they use completely different approximation models. The important factor is that they represent a variety of solutions to the same problem. There is no benefit in evaluating multiple models that all predict the same thing, after all.

With a mean-square error function, the best generalization would be expected when the system generates the expected value of all possible consistent functions, weighted by their probability of occurrence. That is

(16.6)

Averaging the output of different systems is a simple approximation to this expected value and tends to damp out extreme behaviors that might not be justified by the data. Additional advantages are improved fault-tolerance and the ability to retrain poorly performing subsystems using the ensemble average as the target.

Although more sophisticated combination methods are possible, a simple average may do as well as other methods in many cases [83]. A weighted average is often suggested

(16.7)

The weighting factors ck may be determined by a linear regression or depend on how well each subsystem performs on its training data; there are other possibilities. Because similar systems trained on similar data are likely to make similar predictions, colinearity of the fk(x) could make the linear regression ill-conditioned and result in a bad choice ofc values. (This is one suggestion for why a simple average often does as well as more complicated methods.) Use of a convex linear combination in which Σkck = 1 is suggested in [60] for this reason.

Stacked-generalization [409] is a related method for improving generalization. Rather than simply averaging the outputs of several systems, the outputs are combined in more complex ways to maximize generalization.

The idea that replicating networks could help generalization might seem counterintuitive because N replicated networks would have N times as many weights and thus might need many more examples to constrain. The networks are trained independently, however, so the number of examples needed to train each does not change. If identical networks are trained on different subsets of the data (each net having a different holdout set used to control overfitting) and their outputs averaged to obtain the global output, this is similar to doing k-fold cross-validation or bootstrapping in parallel.

In general, a training set can contain regularities on many scales. Different subsystems with different biases, but trained with the same goals, are likely to agree about the large scale regularities that are obviously "supported by the data" while disagreeing mostly on smaller factors. An overtrained subsystem could choose a very idiosyncratic solution that is unlikely to match the real target function, but there are a huge number of ways to overfit the data and independent subsystems are likely to choose different ones. By averaging many responses, the total system expresses the consensus about obvious regularities recognized by most subsystems while avoiding extreme solutions in areas where there is disagreement.

A problem with this approach is that the number of systems that may need to be averaged in order to improve generalization significantly could be very large, particularly when the systems are complex; that is, the estimated mean in equation 16.6 could have a high variance. There is also still a need for external information to bias the learning algorithm to produce subnetworks that share the bias pf(f).