Skip to Book Content
Book cover image

Chapter 4 - MLP Representational Capabilities

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

4.4 Capacity versus Size

Another big question in designing a network is how many nodes to place in each layer. Universal approximation results say that a sufficiently large network can fit almost any function we are likely to want to approximate, but they do not deal with the problem of training a network to fit a finite set of discrete data points. Obviously, we cannot have infinite numbers of nodes in practice so it is useful to have bounds on the number that will be needed to fit a particular function. Even if arbitrarily large networks were allowed, it might not be possible to use them effectively given the limited amount of information contained in the training samples. After the network grows past a certain size, generalization criteria become the limiting factor on performance; a large network can often fit the training data exactly but is unlikely to do so in a way that fits the underlying function that generated the data.

Obviously, if we have m training points, a network with a single layer of m - 1 hidden units can learn the data exactly, since a line can always be found onto which the points project uniquely [80]. (Set each weight vector parallel to this line, make the magnitude large so the sigmoid approaches a step function, adjust the threshold of node k, k = 1 M - 1, so it separates points k and k + 1, and assign the output weights based on the difference in target values for points k + 1 and k.) Of course, this is inefficient and generalizes very badly; it uses as much storage as a nearest neighbor classifier, takes about the same time to evaluate on a serial computer, and probably generalizes worse. (Poor generalization could be expected just on the grounds that the number of weights would be much larger than the number of training samples.)

In general, more efficient solutions are sought. Most interesting functions have structure so each node should be able to account for more than just one training sample.

The sections that follow summarize some results on the number of patterns representable by multilayer networks of a given size. Results are simply listed in most cases; the reader should consult the references for details. Although these results may be used as guidelines in selecting a network, they should not be interpreted as inviolable rules. In many cases, they put bounds on the number of independent samples a given net can represent. The bounds may be loose, however, and actual data may be correlated so smaller networks may suffice for particular problems. Furthermore, we do not need the network to be able to fit any function on the samples, just the particular function that generated the data. We would like the network to fit the data and approximate the true target function, but the true target function is usually unknown and if we make the network powerful enough to represent any possible function on the data, it is probably too powerful. If the results say a given network can fit the m training points exactly, there is a danger of overfitting so smaller networks should probably be considered. Generalization depends on many factors so formulas based only on network size and number of samples cannot predict generalization performance.

4.4.1 Number of Cells Created by m Hyperplanes

For a layered network to realize an arbitrary dichotomy on m points in a d-dimensional space, each point must be uniquely represented in terms of the activities of the first hidden layer units.

Consider a network of linear threshold units. Each unit in the first hidden layer has an associated hyperplane that partitions the input space into two half-spaces. Two hyperplanes can divide the space into four quadrants, three hyperplanes may produce eight octants, and so on. In general, h hyperplanes could produce up to 2h different regions, or cells (assuming h d, the input dimension). The hyperplanes form the cell boundaries. Within each cell, the vector of hidden layer outputs is constant (since the input must cross at least one hyperplane boundary for the output to change) so two points in the same cell are indistinguishable in terms of the hidden layer outputs. Thus, to realize an arbitrary dichotomy on m points, each point must lie in a different cell. If the network does not have enough hidden units to create at least m cells, it won't be able to realize some dichotomies.

An expression for the number of cells created by intersections of hyperplanes is derived by Makhoul, E1-Jaroudi, and Schwartz [255]. The results are similar to Cover's [87] formula for the number of linearly separable dichotomies (section 3.3). Let C(h, d) be the number of cells formed by h planes in general position in the input space of d dimensions. (h planes are in "general position" in d space if none of them are parallel and no more than d planes intersect at a point.) A recursive formula is

(4.3)

with boundary conditions

(4.4)
(4.5)

This gives [255]

(4.6)

For h < d, each new hyperplane can be positioned to split all existing cells so up to 2h cells can be created. For h > d, C(h, d) grows more slowly because each new hyperplane cannot split every existing cell. For h >> d, the last term $ieq: dominates and

(4.7)

The number of cells obtained by adding a new hyperplane (in terms of the number of existing cells) is

(4.8)

and the number of cells obtained when adding a new dimension is

(4.9)

Expressions for the number of open (bounded at infinity) and closed cells are also given in [255]. Two hidden layers are required to implement all the 2C(h,d) possible binary functions on the C(h,d) cells [255].

Although this puts bounds on the number of nodes (hyperplanes) needed to form an arbitrary dichotomy on m points, in practical problems we have one specific dichotomy to implement. The data are likely to have structure and large savings can often be obtained because a single cell can contain entire clusters of points belonging to the same class. Also, when the input data are highly correlated, the effective dimension d' may be less than d.

4.4.2 MLP Capacity I

Baum [33], [32] studied the number of examples needed to train a network with N nodes and W weights for a given error rate Î. These results are independent of the training algorithm and are based on an estimate of the VC dimension of the networka relation between a system's complexity and the amount of information that must be provided to constrain it (see section 15.4).

If a network can be trained with

random examples of the desired mapping so that at least a fraction 1 - Î/2 are correctly classified, then it will almost certainly correctly classify a fraction 1 - Î of test examples drawn from the same distribution (for 0 < Î 1/8) [33], [32]. This is an upper bound for the number of training examples needed. A lower bound for MLPs with one hidden layer is ω(W/Î). That is, a network trained on fewer than ω(W/Î) will fail (some fraction of the time) to find a set of weights that classifies a fraction 1 - Î of future examples correctly.

This agrees with the W/Î rule of thumb given by Widrow for Adaline networks. Similar results are also obtained in linear regression. Roughly, if the network has W adjustable weights and an error rate Î is desired, it is necessary to have on the order of W/Î samples.

Assuming that the function can be learned, this gives an estimate of the network size necessary to learn m training patterns. For input patterns chosen randomly from the domain {-1, +1}n, a network with one hidden layer of linear threshold units can learn the m 2n/3 training patterns.

4.4.3 MLP Capacity II

Baum [30] makes the following points:

  • A MLP with one hidden layer of N - 1 nodes is capable of computing an arbitrary dichotomy on N points [284].

  • A network that implements f(x) on N points in general position in d dimensions, must have at least N/d units in the first hidden layer [30]. If the points are in nongeneral position (structured data), many fewer units may be necessary.

  • A one-hidden-layer net with [N/d] hidden units can compute an arbitrary f(x) on N points in general position [30]. Small changes in the input vector may cause large changes in the output, however, so good generalization is not guaranteed.

  • The number of linearly separable dichotomies of N points in d dimensions, for N 3d, is less than 4Nd![30].

  • No feedforward net (of the type considered) can compute an arbitrary map from N d-dimensional vectors into the e dimensional hypercube unless it has a number of connections

  • For a one-hidden-layer network with d inputs, G hidden units, and e outputs, the number of connections is Nc = G(e + d). Thus, a one-hidden-layer network cannot compute arbitrary functions on a set of N vectors in d-dimensions unless it has at least hidden units.

  • No MLP can compute an arbitrary function, no matter how many layers it has, unless it has units. As for all the items in this section, many fewer units may be needed if the training data is structured.

  • A one-hidden-layer net with N hidden units can represent an arbitrary mapping of N points into the e hypercube.

  • A one-hidden-layer net with hidden units is capable of arbitrary binary mappings.

  • There are 2Ne mappings of N d-dimensional vectors into the e-dimensional hypercube.

4.4.4 MLP Capacity III

Widrow and Lehr [403] consider a fully connected feedforward network of linear threshold units with Nx inputs (excluding the bias), Nh hidden nodes in a single layer, and Ny outputs. There are Nw weights and Np patterns in general position to be learned.

A bound on the number of weights needed is

(4.10)

A loose upper bound for the number of hidden nodes Nh required is

(4.11)

When Nx,Nh > 5Ny (when there are more inputs and hidden nodes than outputs), the deterministic capacity (the number of patterns that can certainly be stored) is bounded below by Nw/Ny. When Nw >> Ny (i.e. 1000×), the capacity is bounded above by

Thus

(4.12)

where k1 and k2 are small constants when Nx + Nh >> Ny.

For good generalization, the number of training patterns should be at least several times larger than the capacity of the network. Otherwise, the amount of data will be insufficient to constrain the network.

4.4.5 4.4.5 MLP Capacity IV

The results of section 3.3 show that, on average, a single linear threshold unit with n inputs can be made to correctly classify up to m = 2n random binary patterns before the probability of error falls to 1/2. In other words, a linear threshold unit has a probabilistic capacity of m = 2n patterns.

Mitchison and Durbin [269] study the capacity of a MLP with one hidden layer. As above, the capacity is defined as the number of random input/output patterns that can be stored before the probability of error on recall reaches 1/2. They find that for a MLP with n inputs, one layer of h hidden units, and s output units, where s h n, the capacity m satisfies

(4.13)

where t = 1 + h/s, n 2, t 2 [269].

If there is a single output that is a fixed Boolean function of the hidden units, then m O(nh log h). Comparing this to the case above when s = 1 and thus t = 1 + h shows that allowing the output to be a variable function of the hidden units has an effect on the capacity equivalent to adding one hidden unit. Because a one-hidden-layer network can be connected in such a way that it has the same response as a single-layer network with the same number of inputs and outputs, the lower bound of m 2n still applies. For the special case s = h = n, they find 2n m 9.329n.

Note that the capacity bounds are defined probabilistically for random functions on a randomly chosen set of points; the actual number of example pairs of a particular function that can be learned by a particular network will depend heavily on the function and how the examples are chosen. These are limiting bounds for the capacity; m certainly exceeds the lower bound and is certainly less than the upper bound. The actual capacity that can be achieved is less than the upper bound, possibly by a large amount.