Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

12.7 Principal Components Node Splitting

A method of node splitting based on detection of oscillation in the weight update directions is described by Wynne-Jones [410]. The idea is that when a network is too small, the weight vectors of hidden units may oscillate between several competing solutions. Oscillation may occur because there are two clusters of data within a class or because a decision boundary is pulled one way by one set of patterns and the other way by another set of patterns. Figure 8.5 illustrates clustering of the weight update directions that could lead to oscillation.

A large amount of weight oscillation is taken as a measure of insufficiency. Nodes whose weights oscillate the most are identified and split in two. The "child" nodes are initialized based on a principal components analysis of the oscillation in the parent node or by examination of the Hessian matrix of the network error with respect to the weights. The Hessian method has the advantage that it can also be applied to the input nodes to determine their relative importance.

This is usually better than initializing child nodes with random weights because it uses the information in the existing weights and usually causes less perturbation in the error. In high dimensional spaces, random weights are unlikely to be well-placed initially and considerable learning may be needed to move them to where they are useful.

Splitting The network is allowed to train until it stops making progress. The weights are then frozen and the training set presented again to evaluate oscillation by computing the principal components of the covariance matrix of weight updates,

(12.14)

C is the outer product of the weight updates δW_p=∂E_p/∂W summed over the patternsP. The mean of δW is assumed to be zero. The largest eigenvalue and corresponding eigenvector of C give the magnitude of the oscillation and its direction. The node is split into two and the child nodes are initialized with weight vectors one standard deviation on either side of the parent vector along the direction of oscillation. This usually results in minimal perturbation of the existing solution but gives enough separation to break symmetry and allow the child nodes to converge to different solutions. Since computation of eigenvectors can be computationally expensive, more practical iterative techniques are mentioned by Wynne-Jones [410]. After splitting, the weights are unfrozen and training resumed.

Selecting Nodes for Splitting The nodes most likely to benefit from splitting are those in which there are very pronounced directions of weight oscillation. Nodes can be compared on this basis by computing the ratio of the largest eigenvalue over the sum of the eigenvalues. This will be highest for nodes with a single dominant direction of oscillation. In high dimensional spaces, however, a node may have several directions with significant eigenvalues (in which case the node could be split along each direction). The ratio will be lower in this case so the ratio technique would not split these nodes until there are no other options. An alternative is to calculate the second derivative of the error with respect to a normalized parameter such as the node gating parameter α described by Mozer and Smolensky [275] (summarized in section 13.2.1). A high curvature of the error with respect to α_i indicates the error is sensitive to the weights of node i. The node is a good candidate for splitting if the curvature of the error in weight space has a dominant direction as indicated by eigenvalues of the Hessian of E(W). Nodes with a small second derivatives of the error with respect to α, on the other hand, have little differential effect on the error and are candidates for pruning. The same process can be used to estimate the sensitivity of the error to the presence or absence of input variables.

Backsliding A potential problem with this method is that the child nodes often revert back to the position of the parent node because of the global properties of the sigmoid activations. That is, a node that makes a strong contribution to part of the global decision boundary may be influenced by training patterns that are far from the boundary. If the node is split, the child nodes may feel similar influences from the distant patterns and choose the same solution, leaving the global boundary unchanged. Node splitting may be more successful, therefore, in local networks such as radial basis functions than in MLP networks. In this case, oscillation in the Gaussian centers is detected.

Chapter 12 - Constructive Methods

12.7 Principal Components Node Splitting