15.1
Cross-Validation
A rather direct way to estimate the generalization ability
of a system is to measure the error on a separate data set that the network has
not seen during training. In simple crossvalidation, that is, the holdout
method, the available data is divided into two subsets: a training set used to
train the network and a test set used to estimate the true error rate. To avoid
obvious bias, both sets should be random samples of the same population.
Ideally, both sets should be large because the larger the training
set, the more accurate the approximation learned by the trained system, and the
larger the test set, the more accurate the estimate of the true error rate. When
data are limited, these goals conflict. As a compromise, sets of roughly equal
size are usually chosen.
With large sample sizes the holdout method can be accurate, but it
has limitations when sample sizes are small. The test samples are unavailable
for training, so the network must be trained on less data with more risk of
overtraining or overfitting. The validation set is used to guard against this,
but, with just a small amount of validation data, the error estimate has a large
variance and may be unreliable; an uncharacteristic test set could give a bad
estimate of the error. If the training error surface is distorted because of
sampling deficiencies, the validation error surface is likely to be similarly
distorted when the data sets have similar sizes. In order for the validation set
to be a better predictor of the true generalization error than the training set,
it will usually have to be several times larger but this limits the amount of
data that can be used for training.
Another problem is that, depending on the training algorithm, the
solution could be indirectly biased toward the validation set so a third,
completely different, data set is needed to form an unbiased estimate of the
error. That is, it is common to train a number of networks and choose the one
that performs best on the validation set. If thousands of networks were
generated, a few might coincidentally have low errors on the validation set but
still not generalize well. Because the validation set is used, albeit
indirectly, as part of the training process, there is a danger of obtaining a
biased solution.
Simple cross-validation as described here uses a single holdout
set to estimate the generalization error. Resampling techniques such as
leave-one-out, k-fold cross-validation, and bootstrapping [389], [116] address limitations of the single holdout method by
averaging over multiple holdout experiments using different partitions of the
data into training and test sets. Once impractical, these sorts of methods have
become feasible due to increasing computer processing power. Advantages are that
the error estimates are generally more accurate, the network can be trained on
almost all the data, and it can be tested on all of it. The drawback is an
increased computational burden. It should be noted that these are nonparametric
methods that do not make restrictive assumptions about data distributions and
are not restricted to linear models.
Bootstrapping is one of the most accurate techniques, in
general, but also one of the slowest. As noted, the estimate obtained from a
single holdout set may have a large variance. Bootstrapping lowers the variance
(at the expense of a slight increase in the bias) by averaging estimates
obtained from many different partitions of the data [389]. It is common to use hundreds or thousands of subset
estimates. If the method is used to obtain a more accurate one-time estimate of
the generalization ability of a trained network, the computational burden may
not be a critical factor because network training times are already long in most
cases. This probably is not a practical way of comparing network architectures
if each subset sample requires the training of a new network.