Skip to Book Content
Book cover image

Chapter 15 - Generalization Prediction and Assessment

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

15.4 PAC Learning and the VC Dimension

Valiant's PAC (probably approximately correct) learning model and the VapnikChervonenkis (VC) dimension have been used to study the problem of learning binary valued functions from examples, for example, [376], [49], [1]. These relate the complexity of a learning system to the number of examples required for it to learn a particular function from a given class of functions. Briefly, if the number of examples is small relative to the complexity of the system, the generalization error is expected to be high.

Abu-Mostafa [1] provides a brief tutorial on which the following paragraphs are based. More complete descriptions can be found in several texts, [282] for example. A learning algorithm samples points x Î X, observes the target values f(x), and tries to find a hypothesis function g(x) that matches f everywhere. The examples are assumed to be drawn independently from some fixed distribution, which can be arbitrary as long as the same distribution is used for learning and testing. The hypothesis functions are drawn from a restricted class G. The algorithm chooses among hypothesis functions based on their performance on the samples. If the number of samples is too small, the estimated performance vg (the frequency of error on the test set) could differ significantly from the actual performance πg and the algorithm could be fooled. A condition for uniform convergence [379] is

(15.16)

where m is a function which depends on G. For vg to approach πg as the number of samples N becomes large, the righthand term must approach 0. The eN/8 term decays exponentially with N so convergence is possible if the function m(2N) does not grow too fast. This is satisfied when m(N) is polynomial in N, for example [1].

The growth function m(N) measures the number of ways that G can label N arbitrary but independent points. The VC dimension d of class G measures the maximum number of points N for which a function in G can always be found that will fit the points no matter how they are labeled. For N < d, m(N) grows exponentially with N (i.e. 2N). For N > d, G cannot realize some labelings of the points and m(N) ceases to grow exponentially. Thus m(N)2d + 1.

The importance of this to learning is that if the number of examples is large compared to the VC dimension of the target function class, then equation 15.16 promises uniform convergence. The estimated error rates will then be close to the actual rates and the learning algorithm has a reliable method to choose the best hypothesis.

The sample complexity m(Î, δ) of a class G is the smallest sample size that guarantees uniform convergence for all target concepts in G and all sampling distributions. An upper bound is [49]

(15.17)
(15.18)

This is relevant to neural network training in that a network is capable of representing a certain class of concepts and so has some particular VC dimension (e.g., the VC dimension of a simple perceptron with k inputs is k + 1). If the network can be trained on a number m(Î, δ) of examples achieving an error no greater than Î on the training set, then, with probability 1 - δ, one expects that the true error is no greater than 2Î and similar average error can be expected on novel examples drawn from the same distribution.

By assuming a uniform sampling distribution, the VC dimension of a feedforward network with Nnodes and W weights has been estimated as [33], [32], [34]

(15.19)

This has been used to put an upper bound on the number of examples that might be needed to achieve a given generalization error rate. If the network can be trained with

(15.20)

randomly selected training examples achieving an error rate of less than Î/2, then a generalization error rate of at most Î can be expected for examples drawn from the same distribution (for 0 < Î 1/8). This agrees with the rule of thumb that roughly O(W/Î) examples are needed to achieve a generalization error less than Î.

For a network with N inputs and one hidden layer of H units [33],

(15.21)

This is approximately equal to the number of weights W for large H, also suggesting that ω(W/Î) examples are needed to achieve a generalization error less than Î.

Problems Because the theory is very general, the estimated bounds are loose. Its main value to neural network design seems to be to indicate that if there are enough examples, then the training error should be a good predictor of the generalization error. This allows broad statements to be made about the appropriate size of a network given a particular amount of training data. These bounds, however, do not apply to networks with multiple continuous outputs and they do not say how to choose a suitable network given a particular set of examples to be learned. Other concerns are that the analysis is asymptotic, whereas data are often finite, and the bounds are worst-case (over any data distribution) and appear to be overly pessimistic; the number of examples required to satisfy PAC requirements is often very high. For practical problems, the average-case behavior may be more important. Numerical tests [85], [172] show that the average behavior can be better than the VC bounds in many cases.

The basic theory also ignores peculiarities of specific learning algorithms. Techniques such as pruning and regularization may add constraints that prevent full exploitation of the intrinsic network complexity. Large networks can realize complex functions, but they can also mimic simple (e.g., linear) functions. A network is usually initialized with small weights and the resulting input-output relationship is very smooth, almost linear. As the network is trained, the weights become larger and the transfer function more complex. The complexity of the transfer function thus depends on other factors in addition to network size. Some work has been done to estimate the effective number of parameters based on the network response function [378], [147]. Normally, the effective dimension cannot be calculated analytically but it may be estimated from network performance. This approach may be able to account for dynamic changes in network complexity as a result of training.

This is an active research area, however, and new results continue to appear. It is known, for example, that networks with continuous activation functions are more powerful than networks of threshold units. Recent work suggests that networks with continuous unit activations can have VC dimension at least as large W2, where W is the number of weights [213]. This means that it may be very hard to constrain a network with reasonable numbers of examples.