Chapter 12: Constructive Methods

Overview

One of the major tasks in the design of a network is the selection of an architecture and its configuration. How many layers should the network have, with how many nodes in each layer? Should every node in a layer connect to every node in the following layer? Obviously the best structure depends on the problem and performance may be poor if the structure is inappropriate. If there are too few units, the final error is likely to be large; training may not converge or may be very sensitive to initial conditions. If there are too many units, training times may be long and generalization may suffer.

Normally we just want to find a structure that works for the given problem. Although this is much easier than determining the theoretically optimum architecture, it may still be very difficult to decide a priori what architecture and size are appropriate for a problem. Some heuristics are available, but there are no dependable general rules for choosing a structure before training. A common ad hoc approach is to experiment with different configurations until one is found that works well. Unfortunately, this may be time consuming if many networks have to be tested before an adequate one is found.

Constructive methods attempt to adapt the size of the network to the problem by starting with a small network and adding layers and units as needed until a solution is found. The major advantage is that there is no need to make an a priori estimate of the correct network size. An unfortunate choice will not immediately condemn the network to failure and the trial-and-error search for a good size is avoided.

Theoretical Support There are good theoretical reasons for considering constructive algorithms. As noted in section 5.7, a common complaint about back-propagation is that it is too slow. The training time also appears to grow quickly as the problem size increases. Judd [204], [202] has shown that loading problem is NP-complete. That is, the training time scales exponentially with network size in the worst case. (The loading problem is the task of finding weight values such that the network imitates the training data, i.e., the task of "loading" the data into the network.) Baum and Rivest [35] showed that the problem is NP-complete even for networks containing as few as three neurons.

These results are an indication of the intrinsic difficulty of the computational problem, independent of the training algorithm. All algorithms, no matter how efficient, that deal with the task as it is posed face the same exponential scaling behavior and become impractical on very large problems. These results make it appear "unlikely that any algorithm which simply varies weights on a net of fixed size and topology can learn in polynomial time" [31], [201]. Baum [31], however, suggests that the difficulty is due to the constraint of a fixed network structure that only allows the algorithm to adjust existing weights. Algorithms with the freedom to add units and weights "can solve in polynomial time any learning problem that can be solved in polynomial time by any algorithm whatever. In this sense, neural nets are universal learners, capable of learning any learnable class of concepts" [31], [201].

A trivial example is an algorithm that simply allocates a node for every example in the training set to create a network that functions as a lookup table. Of course, more efficient solutions are generally preferred. Several types of non-MLP neural networks, for example, ART and radial basis function networks, can be thought of as adding new units when necessary to fit new data. Most learn much faster than MLPs trained by back-propagation.

Constructive Methods vs. Pruning Methods Constructive methods complement pruning methods (chapter 13), which train a larger-than-necessary network and then remove unneeded elements. Both are means of adapting the network size to the problem at hand. Although pruning methods can be effective, they require an estimate of what size is "larger than necessary." Constructive methods can build a network without this estimate.

Because constructive methods sometimes add more nodes than necessary, it is often useful to follow with a pruning phase. In some algorithms the processes compete simultaneously, one attempting to add nodes while the other tries to remove them. At some point, the processes balance and the structure stabilizes.

When to Stop Adding Units An issue that must be considered with constructive methods is when to stop adding new units. In general, the training-set error can be made as small as desired by adding more units, but the law of diminishing returns predicts that each additional unit will produce less and less benefit. The question is whether the incremental error reduction is worth the cost of the additional units in processing time, storage requirements, and hardware costs. For continuous problems, an infinite number of units might be needed to achieve zero error. One generally must declare some nonzero error to be acceptably small and stop when it is achieved.

Aside from the question of efficiency, there is the problem of overfitting and generalization. Chapter 14 discusses a number of factors affecting generalization. Briefly, the problem is that when training on sampled data (which may contain noise and have other imperfections), the error on the training set is only an estimate of the true error. The two error functions tend to be similar but slightly different so a change that reduces one will not always reduce the other. Usually, they have large-scale similarities with small-scale differences. As the network fits the large scale features of the training-set in the initial stages of training, both errors tend to decrease together as learning progresses. At some point, however, the network starts to fit small-scale features where the two functions differ and additional training starts to have a detrimental effect on the true error. Improvements in the training error no longer correspond to improvements in the generalization error and the network begins to overfit the data. Thus, for good generalization, it is often desirable to stop training before the training-set error reaches zero. Some implementations avoid the problem by passing it to the pruning algorithm; the constructive phase is allowed to continue well past the point of overfitting and then followed with pruning to satisfy generalization criteria.

Network Size versus Training Time A secondary advantage of constructive algorithms is that overall training times may be shorter because useful learning occurs when the network is still small. That is, even if a small network cannot satisfy the error criteria, it may learn the dominant characteristics of the target function and thereby simplify learning in later stages. With nonconstructive methods, an inadequate network would be abandoned and anything it learned would have to be relearned by the next network tested. With constructive methods, the learning is retained and finer details are picked up as more nodes are added.

There seems to be a trade-off between training time and network size with fast algorithms tending to produce larger, less efficient networks. Many algorithms train the network until the error stops decreasing and then add more units and resume training, repeating until the error is acceptably small. The problem is that plateaus in the error versus time curve are common with back-propagation training of MLP networks. The E(t) error curve often has long flat intervals followed by a sharp drop. In a flat region, it may be difficult to tell if the error has reached its final minimum or if it will decrease further if we let it train longer. A constructive algorithm that does not wait long enough may add unnecessary units; one that waits too long just wastes time. Thus, if one is impatient, the resulting network may be larger than necessary and may not generalize as well as possible. This is another reason for combining constructive and pruning methods.

There are, of course, more sophisticated methods of testing for (near) convergence than thresholding the E(t) slope. Second derivative information in the form of the Hessian matrix (or at least its diagonal elements) may be useful, but will not entirely solve the problem because of the nonlinearity of the problem.

Chapter 12 - Constructive Methods

Chapter 12: Constructive Methods

Overview