Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

4.3 Size versus Depth

Since a single sufficiently large hidden layer is adequate for approximation of most functions, why would anyone ever use more? One reason hangs on the words "sufficiently large." Although a single hidden layer is optimal for some functions, there are others for which a single-hidden-layer solution is very inefficient compared to solutions with more layers.

Certain functions containing disjoint decision regions cannot be realized exactly with only a single hidden layer of threshold units [255]. Certain functions can be implemented exactly by small networks with two hidden layers but require an infinite number of nodes to approximate with a single hidden layer network [255], [254]. Figure 4.6 shows an example from [255]. Gibson and Cowan [137] provide another example. Chester [80] describes a "pinnacle" function (1 at the origin and 0 elsewhere) for which a single-hidden-layer network needs 0(1/ Î) nodes to achieve O(Î) maximum error where a two-hidden-layer network needs only 4 nodes. This uses a maximum error criteria, however, rather than mean squared error. For certain functions, single-hidden-layer networks may require very large weights or extreme precision [180].

Although these results suggest that two-hidden-layer networks are more powerful, one-hidden-layer networks may be adequate for many functions encountered in practice. One and two-hidden-layer networks are compared empirically by de Villiers and Barnard [104]. For fair comparison, the networks are configured to have approximately the same number of weights. Using conjugate gradient training and artificial clustered 2-dimensional data (mixtures of Gaussians), no significant difference was found in the best solution found by either architecture. The one-hidden-layer nets were reported to have lower average error and perhaps generalized better. For two-hidden-layer nets, the same authors report that training is easier if the hidden layers have approximately equal sizes. Similar results are reported by Huang and Lippmann [180]; no significant difference was seen in error rate or convergence time.

Figure 4.6: Certain functions can be implemented exactly by small networks with two hidden layers but require an infinite number of nodes to approximate if constrained to a single hidden layer [255], [254]: (a) a simple function that can be implemented with two hidden layers each containing two threshold units, (b) an approximation by a network with just one hidden layer, (from [255]).

4.3.1 Size versus Depth for Boolean Functions

Many results exist for Boolean logic functions, e.g., [347]. (See figure 4.7 for the elementary gates.) It is obvious that a net with one hidden layer of 2ⁿ threshold units can compute any Boolean function of n binary inputs since any Boolean function can be expressed in conjunctive (or disjunctive) normal form. Each hidden unit computes one of the 2ⁿ product terms and the output unit ORs selected product terms to produce the desired result. An AND-OR array has this structure. This amounts to exhaustive enumeration of all positive cases and becomes impractical for large n.

More economical solutions can often be obtained by adding more layers. The number of nodes needed can be reduced by a factor of because any Boolean function of n variables can be computed by a network with two hidden layers and O(2^n/2) threshold gates [347]. This is nearly optimal as an unbounded depth circuit must have size [347], [125]. Notice that this is an exponential reduction in size.

Because a one-hidden-layer solution may be inefficient, there may be functions that cannot be implemented by a network of limited size using just one hidden layer. It is known that there are Boolean functions which cannot be computed by threshold networks with one hidden layer and a constant number of nodes, but which can be computed by threshold networks with two hidden layers and a constant number of nodes ([250]; cited in [99]).

Figure 4.7: AND and OR logic functions. Any logic function can be implemented by a circuit of AND, OR, and NOT gates. Threshold units are at least as powerful as logic circuits because they can implement all the necessary elementary functions, as well as others: (a) the AND function, (b) linear threshold unit implementation of AND, (c) the OR function, and (d) linear threshold unit implementation of OR. The NOT function can be obtained from a singleinput unit with a large negative weight and a zero threshold.

"Symmetric" functions depend only on the sum of the inputs, Σ_ix_i; parity is one example. For arbitrary symmetric Boolean functions of n inputs, O (n) units are needed in depth-2 (one hidden layer) networks of linear threshold units (LTU), but only units are needed in depth-3 (two hidden layers) networks [348], [347]. For periodic symmetric functions such as parity, a depth/size tradeoff can be obtained at all depths. For parity (n), there is a depth (d + 1) LTU circuit of size O(dn^1/d) [347]. But increasing the depth beyond 3 does not decrease size much; LTU gates are needed to compute general symmetric functions if there is no restriction on depth [347].

Another reason to use more than one hidden layer is to decrease the weight magnitudes. A one-hidden-layer implementation may require arbitrarily large weights. If permissible weight values are bounded (e.g., by hardware limitations), networks with more hidden layers may be more efficient. A depth-d circuit with exponential weights can be simulated by a depth-(d + 1) circuit with polynomially bounded weights at the cost of a polynomial increase in network size [347], [40]. A single linear threshold unit with exponential weights can be exchanged for a depth-3 polynomial size circuit with polynomial weights [347], [41]. Anything computable by a depth-2 threshold circuit of polynomial size can be computed by depth-3 small weight threshold circuits of polynomial size ([143]; cited in [142]). Thus if the range of weights is limited, networks with more than one hidden layer may be able to realize functions that cannot be realized by single-hidden-layer networks.

Caveats The results just stated say there are functions for which single-hidden-layer networks are less efficient than networks with more hidden layers. Of course, there are still functions where small single-hidden-layer networks are optimal and additional hidden layers are not useful. Single-hidden-layer networks may need large numbers of nodes to compute arbitrary functions, but small networks may suffice for particular functions.

Many of these results are statements about the power of a class of networks rather than guarantees that a particular network will be able to learn a particular set of data and generalize accurately. Many are based on asymptotic analyses valid only for large data sets or large input dimensions. Many are statements about the existence of a solution rather than constructive statements about how to find the necessary set of weights. Most depend on particular assumptions that may be violated in a given problem.

In real problems, training sets may be rather small and nonrandomly sampled, data distributions may have arbitrary forms, and the target function is unknown. It may be problematic just to determine if the data fits the assumptions. Although representational capability results can be helpful in putting bounds on the size and configuration of a network, they do not guarantee that a particular network and training algorithm will be able to learn a particular set of samples of a given function. Even if it is guaranteed that a particular network can exactly classify N training points (using some set of weights), this does not necessarily imply that the system will generalize well to new points. Finally, many of the results for Boolean functions are for exact implementation and may not hold when approximation is allowed.

Chapter 4 - MLP Representational Capabilities

4.3 Size versus Depth

4.3.1 Size versus Depth for Boolean Functions