Skip to Book Content
Book cover image

Chapter 15 - Generalization Prediction and Assessment

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

15.1 Cross-Validation

A rather direct way to estimate the generalization ability of a system is to measure the error on a separate data set that the network has not seen during training. In simple crossvalidation, that is, the holdout method, the available data is divided into two subsets: a training set used to train the network and a test set used to estimate the true error rate. To avoid obvious bias, both sets should be random samples of the same population.

Ideally, both sets should be large because the larger the training set, the more accurate the approximation learned by the trained system, and the larger the test set, the more accurate the estimate of the true error rate. When data are limited, these goals conflict. As a compromise, sets of roughly equal size are usually chosen.

With large sample sizes the holdout method can be accurate, but it has limitations when sample sizes are small. The test samples are unavailable for training, so the network must be trained on less data with more risk of overtraining or overfitting. The validation set is used to guard against this, but, with just a small amount of validation data, the error estimate has a large variance and may be unreliable; an uncharacteristic test set could give a bad estimate of the error. If the training error surface is distorted because of sampling deficiencies, the validation error surface is likely to be similarly distorted when the data sets have similar sizes. In order for the validation set to be a better predictor of the true generalization error than the training set, it will usually have to be several times larger but this limits the amount of data that can be used for training.

Another problem is that, depending on the training algorithm, the solution could be indirectly biased toward the validation set so a third, completely different, data set is needed to form an unbiased estimate of the error. That is, it is common to train a number of networks and choose the one that performs best on the validation set. If thousands of networks were generated, a few might coincidentally have low errors on the validation set but still not generalize well. Because the validation set is used, albeit indirectly, as part of the training process, there is a danger of obtaining a biased solution.

Simple cross-validation as described here uses a single holdout set to estimate the generalization error. Resampling techniques such as leave-one-out, k-fold cross-validation, and bootstrapping [389], [116] address limitations of the single holdout method by averaging over multiple holdout experiments using different partitions of the data into training and test sets. Once impractical, these sorts of methods have become feasible due to increasing computer processing power. Advantages are that the error estimates are generally more accurate, the network can be trained on almost all the data, and it can be tested on all of it. The drawback is an increased computational burden. It should be noted that these are nonparametric methods that do not make restrictive assumptions about data distributions and are not restricted to linear models.

Bootstrapping is one of the most accurate techniques, in general, but also one of the slowest. As noted, the estimate obtained from a single holdout set may have a large variance. Bootstrapping lowers the variance (at the expense of a slight increase in the bias) by averaging estimates obtained from many different partitions of the data [389]. It is common to use hundreds or thousands of subset estimates. If the method is used to obtain a more accurate one-time estimate of the generalization ability of a trained network, the computational burden may not be a critical factor because network training times are already long in most cases. This probably is not a practical way of comparing network architectures if each subset sample requires the training of a new network.