Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

3.3 Hyperplane Capacity

What is the probability that a random set of points with random binary labels is linearly separable? Cover [87] provided the following result. Given N points in a d-dimensional input space, there are 2^N possible ways of labeling the points 0 or 1. Each of these forms a dichotomy—a division of the N points into two classes. A dichotomy is linearly separable if all the Os can be separated from all the 1s by a hyperplane. It is homogeneously linearly separable if the points can be separated by a hyperplane passing through the origin. A linear separation of N points in d dimensions is a homogeneous linear separation in d + 1 dimensions because the offset of a hyperplane that does not pass through the origin can be absorbed into an extra bias weight.

Cover [87] defines the capacity of a hyperplane as the number of dichotomies it can separate. For N points in general position (defined below) in a Euclidean space of dimension d, the number C(N, d) of homogeneously linearly separable dichotomies is

(3.7)

This result is for N points in general position. If the N points are not in general position, the number of linearly separable dichotomies may be much lower. A set of d-dimensional vectors is in general position if all possible subsets of d + 1 or fewer vectors are linearly independent. For N > d, this requires that no set of d + 1 points lie on a (d - 1)-dimensional hyperplane. In d = 3 dimensions, for example, no set of 4 points may be coplanar. For N ≤ d, N points are in general position if no (d - 2)-dimensional hyperplane contains them all.

C(N,d) can be computed recursively with the following relations.

(3.8)

The first relation says a single point can be classified in two ways. The second relation says N points on a line can be classified by a linear separator in 2N ways: the separator can fall in any of the N - 1 intervals between points and it can be oriented in 2 ways so the zeros fall on one side or the other. Adding the two cases where all points are either zero or one gives the total 2N.

The recurrence relation is obtained as follows. Suppose N points in d dimensions form C(N, d) dichotomies and we add another point p. This divides the existing dichotomies into two classes:

Some of the existing dichotomies cannot be obtained by a hyperplane through p. When p is added, each of these continues to be a single dichotomy in the new system with p taking whichever value the existing dichotomy dictates.
The rest of the existing dichotomies could have been obtained by a hyperplane passing through p. Each of these gives rise to two new dichotomies where p is either 0 or 1 because the hyperplane can be shifted infinitesimally to either side of p without changing the classification of the remaining points.

If M₁ and M₂ are the number of dichotomies in the two classes, then the number of dichotomies in the new system is C(N + 1, d) = M₁ + 2M₂. But M₁ + M₂ = C(N, d) so C(N + 1, d) = C(N, d) + M₂. However, M₂ = C(N, d - 1) because constraining a hyperplane to pass through p (as well as the origin) is equivalent to reducing the dimension d to d - 1. Substitution gives the recurrence relation C(N + 1, d) = C(N, d) + C(N, d - 1).

Repeated application of the recurrence to the terms on the right yields

(3.9)

and (3.7) follows on noting that [87]

(3.10)

Returning to Equation 3.7, all dichotomies on N ≤ d + 1 points in general position are linearly separable in d dimensions. (All possible labelings of 3 points in 2 dimensions are linearly separable, for example.) For N > d + 1, only C(N, d) of the 2^N possible dichotomies are linearly separable. The probability that a randomly chosen dichotomy is linearly separable is then

(3.11)

Figure 3.8: Capacity of a hyperplane. Shown is the probability that a random dichotomy on N points in d-dimensions is linearly separable, for various values of d. The probability is greater than 1/2 for N < 2(d + 1), and less than 1/2 for N > 2(d + 1). As d increases, the transition from high to low probability becomes steeper and approaches a step function.

This probability is plotted as a function of N/(d + 1) in figure 3.8 for a few values of d. It can be shown that C(N, d) = 2^N-1 and so f(N, d) = 1/2 at N = 2(d + 1). When d is large, f(N, d) is greater than 1/2 as long as N < 2(d + 1) and less than 1/2 for N > 2(d + 1). As d becomes large, the transition becomes steeper and almost all dichotomies of N < 2(d + 1) points are linearly separable while almost all dichotomies of more points are not. At N = 2(d + 1), one-half of the dichotomies are linearly separable, which suggests N = 2(d + 1) points as the separating capacity of the hyperplane.

On the average, a single linear threshold unit with d weights and a threshold can be made to correctly classify up to 2(d + 1) random binary patterns before the probability of error falls to 1/2 [87]. For large N, only a tiny fraction of the possible mappings are linearly separable. In terms of generalization, this means that a d-input linear threshold unit can implement any dichotomy on d + 1 or fewer patterns (in general position) and, when d is large, has a high probability of being able to learn a random dichotomy on up to 2d patterns. When N < 2d, generalization may be poor because there is a high probability that the training points will be linearly separable even if the function generating the labels is not; the network is underconstrained and the solution is unlikely to generalize well to new patterns. In linear regression, a common heuristic is to require N ≥ 3d or more training patterns to avoid overfitting.

The assumptions behind this result should be noted: the points are presumed to be in general position with randomly chosen target labels. In real problems, the points will not necessarily be in general position and if the labels are determined by a physical process there is an assumption that the label of a point can be inferred from its position. Obviously there are sets of N >> 2d points which are linearly separable (two well separated clusters, for example) and there are sets of as few as 4 points, not in general position, which are not linearly separable (e.g., four coplanar points in an exclusive-OR configuration).

Chapter 3 - Single-Layer Networks

3.3 Hyperplane Capacity