Skip to Book Content
Book cover image

Chapter 3 - Single-Layer Networks

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

3.3 Hyperplane Capacity

What is the probability that a random set of points with random binary labels is linearly separable? Cover [87] provided the following result. Given N points in a d-dimensional input space, there are 2N possible ways of labeling the points 0 or 1. Each of these forms a dichotomya division of the N points into two classes. A dichotomy is linearly separable if all the Os can be separated from all the 1s by a hyperplane. It is homogeneously linearly separable if the points can be separated by a hyperplane passing through the origin. A linear separation of N points in d dimensions is a homogeneous linear separation in d + 1 dimensions because the offset of a hyperplane that does not pass through the origin can be absorbed into an extra bias weight.

Cover [87] defines the capacity of a hyperplane as the number of dichotomies it can separate. For N points in general position (defined below) in a Euclidean space of dimension d, the number C(N, d) of homogeneously linearly separable dichotomies is

(3.7)

This result is for N points in general position. If the N points are not in general position, the number of linearly separable dichotomies may be much lower. A set of d-dimensional vectors is in general position if all possible subsets of d + 1 or fewer vectors are linearly independent. For N > d, this requires that no set of d + 1 points lie on a (d - 1)-dimensional hyperplane. In d = 3 dimensions, for example, no set of 4 points may be coplanar. For N d, N points are in general position if no (d - 2)-dimensional hyperplane contains them all.

C(N,d) can be computed recursively with the following relations.

(3.8)

The first relation says a single point can be classified in two ways. The second relation says N points on a line can be classified by a linear separator in 2N ways: the separator can fall in any of the N - 1 intervals between points and it can be oriented in 2 ways so the zeros fall on one side or the other. Adding the two cases where all points are either zero or one gives the total 2N.

The recurrence relation is obtained as follows. Suppose N points in d dimensions form C(N, d) dichotomies and we add another point p. This divides the existing dichotomies into two classes:

If M1 and M2 are the number of dichotomies in the two classes, then the number of dichotomies in the new system is C(N + 1, d) = M1 + 2M2. But M1 + M2 = C(N, d) so C(N + 1, d) = C(N, d) + M2. However, M2 = C(N, d - 1) because constraining a hyperplane to pass through p (as well as the origin) is equivalent to reducing the dimension d to d - 1. Substitution gives the recurrence relation C(N + 1, d) = C(N, d) + C(N, d - 1).

Repeated application of the recurrence to the terms on the right yields

(3.9)

and (3.7) follows on noting that [87]

(3.10)

Returning to Equation 3.7, all dichotomies on N d + 1 points in general position are linearly separable in d dimensions. (All possible labelings of 3 points in 2 dimensions are linearly separable, for example.) For N > d + 1, only C(N, d) of the 2N possible dichotomies are linearly separable. The probability that a randomly chosen dichotomy is linearly separable is then

(3.11)
Click To expand
Figure 3.8: Capacity of a hyperplane. Shown is the probability that a random dichotomy on N points in d-dimensions is linearly separable, for various values of d. The probability is greater than 1/2 for N < 2(d + 1), and less than 1/2 for N > 2(d + 1). As d increases, the transition from high to low probability becomes steeper and approaches a step function.

This probability is plotted as a function of N/(d + 1) in figure 3.8 for a few values of d. It can be shown that C(N, d) = 2N-1 and so f(N, d) = 1/2 at N = 2(d + 1). When d is large, f(N, d) is greater than 1/2 as long as N < 2(d + 1) and less than 1/2 for N > 2(d + 1). As d becomes large, the transition becomes steeper and almost all dichotomies of N < 2(d + 1) points are linearly separable while almost all dichotomies of more points are not. At N = 2(d + 1), one-half of the dichotomies are linearly separable, which suggests N = 2(d + 1) points as the separating capacity of the hyperplane.

On the average, a single linear threshold unit with d weights and a threshold can be made to correctly classify up to 2(d + 1) random binary patterns before the probability of error falls to 1/2 [87]. For large N, only a tiny fraction of the possible mappings are linearly separable. In terms of generalization, this means that a d-input linear threshold unit can implement any dichotomy on d + 1 or fewer patterns (in general position) and, when d is large, has a high probability of being able to learn a random dichotomy on up to 2d patterns. When N < 2d, generalization may be poor because there is a high probability that the training points will be linearly separable even if the function generating the labels is not; the network is underconstrained and the solution is unlikely to generalize well to new patterns. In linear regression, a common heuristic is to require N 3d or more training patterns to avoid overfitting.

The assumptions behind this result should be noted: the points are presumed to be in general position with randomly chosen target labels. In real problems, the points will not necessarily be in general position and if the labels are determined by a physical process there is an assumption that the label of a point can be inferred from its position. Obviously there are sets of N >> 2d points which are linearly separable (two well separated clusters, for example) and there are sets of as few as 4 points, not in general position, which are not linearly separable (e.g., four coplanar points in an exclusive-OR configuration).