Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

7.2 Nonrandom Initialization

An alternative to random initialization is to base the initial state of the system on an approximate solution provided by another type of classifier or fitting system. The goal is to start the system at a reasonably good solution which can then be fine tuned by normal training methods. If things work as planned, the initial state will be in the basin of attraction of a good minimum so global search capability and convergence time are not critical issues and relatively simple training methods may therefore suffice. (However, techniques such as quasi-Newton methods or conjugate gradients may be preferred over back-propagation because final tuning is where they really outperform simpler methods.)

In many cases, the initialization time will be negligible compared to training times for randomly initialized networks. Even when the time is not negligible, it may be regained later because (1) the actual training time may be much shorter if the initial solution is good and (2) problems of convergence to poor local minima may be avoided because the system is already at an approximate solution when it starts. Often, only one fine-tuning run will be needed. With random initialization, in contrast, it is usually necessary to train a number of networks with different starting weights because of the possibility of poor local minima and the total training time for all networks should be considered when making comparisons.

Another reason for using nonrandom initializations is to embed domain-dependent information in a network. In many applications there is significant human knowledge that might be useful even though it might be incomplete or only partly reliable. Existing techniques may give reasonable but imperfect solutions or we may have partial knowledge about the form the solution should take. Pure training on examples in an unstructured network does not provide an efficient way to use this information. When training data are sparse, the external bias supplied by the initial solution may be a major factor in obtaining a solution that generalizes well.

7.2.1 Principal Components Initialization

Principal components analysis (PCA) attempts to identify the major axes of variation of a data set—the directions along which the data varies the most. The principal components of a data set are determined from the eigenvectors of the covariance matrix

The eigenvalues measure the variance of the projection of the data onto the corresponding eigenvectors. The principal eigenvector (having the largest eigenvalues) indicates the direction along which the data varies the most; lesser eigenvectors (having smaller eigenvalues) indicate directions with lesser amounts of variation. PCA is discussed further in appendix B.

Going on the assumption that these are important directions for the function to be learned, one can initialize the input-to-hidden weight vectors along these directions. In a network with H hidden nodes, the H eigenvectors with the largest eigenvalues would be selected. A two-phase algorithm is described in [136]. In phase one, the input-to-hidden weights are fixed to the principal component directions while the output weights are trained. In the second phase, all weights are allowed to learn. The authors report a large drop in error after the first few iterations of the second phase and an overall speed up in learning times.

7.2.2 Discriminant Analysis Initialization

As an unsupervised method, principal components analysis ignores classification information when choosing its directions. Discriminant analysis, a related linear projection method, uses class information and so may yield a better set of directions for classification purposes (see section B.2).

7.2.3 Nearest Neighbor Classifier Initialization

A nearest neighbor classifier that performs well on a classification problem can be used to initialize a layered network solution. The following method is described by Smyth [349]. Similar ideas are suggested by Raudys and Skurikhina [303] and others.

The process starts with a set of cluster centers for each class. The k-means algorithm, or a similar method such as LVQ, may be used to select representatives. A bisecting hyperplane is then created for every pair of centers with different class memberships. For n cluster centers, there will be n(n - 1)/2 hyperplanes; a culling algorithm selects the fraction of these that actually separate adjacent centers. Each remaining hyperplane determines the weight vector for a node in the hidden layer.

Next, the cluster centers are applied as inputs to calculate the hidden-to-output weights. The hidden unit responses are treated as fixed in this phase so the hidden-to-output weights can be found by solving a system of linear equations (e.g., by a pseudoinverse solution). The response can then be fine tuned by normal back-propagation training.

It was noted that the method does not work for the XOR problem because of limitations in the algorithm used to cull redundant hyperplanes [349]. This is a limitation in the specific culling algorithm, however, not a fundamental problem.

The algorithm may allocate more hidden units than are actually needed because it ignores the fact that hyperplanes can often be shared to separate more than one pair of clusters. A post-training pruning step might address this problem.

Nearest neighbor initialization methods should work well if a nearest neighbor classifier performs well on the classification problem. There is a question of how many cluster centers to choose; the choice is presumably based on the performance of the nearest neighbor classifier. It should be noted that simple unsupervised clustering procedures such as the k-means algorithm use only the input patterns when selecting cluster centroids; the output classifications are ignored. When the data are naturally clustered in the input space and these clusters correspond to output classifications, unsupervised clustering methods may select an adequate set of centroids. If not, then other centroid selection methods will be required.

Also, nearest neighbor classifiers often require large data sets to achieve good accuracy so random initialization may give better results when data are limited. In an empirical test, randomly initialized nets generalized better in the sense that the difference between the training and test set errors was smaller. Because the nearest-neighbor initialization starts the network at reasonably good solution, it may be easier to overtrain.

7.2.4 Prototype Initialization

Another nearest-neighbor initialization is discussed by Denoeux and Lengelle [105]. There is one hidden node for each prototype. Input patterns are normalized so each input vector has unit length, ||x|| = 1. This can be done without loss of information by increasing the input dimension by 1 to encode the (suitably scaled) magnitude. When the weight vectors are also unit length, ||w|| = 1, then w^T x = 2 - ||w - x||² and a sigmoid unit has properties similar to the Gaussian units normally used in radial basis functions. (The normalized inputs lie on surface of an n -dimensional sphere which is divided into two sections by the hyperplane defined by w.) The hidden units thus have localized responses in the input space and can be viewed as recognizing prototype patterns.

When gains are high, the hidden layer response approximates a nearest neighbor selector. That is, for nonborderline cases, the hidden unit whose weight vector is closest to the input pattern will respond strongly while more distant hidden units respond weakly, if at all. The response at the hidden layer thus approximates a 1-of-N representation so hidden-to-output weights are independent and can be set directly from the value of the target function at the prototype location. If hidden unit j responds to the jth prototype p_j, for example, then hidden-to-output weight w_j = f-¹(t_j) where f is the node nonlinearity function and t_j is the target value for the prototype. When prototype p_j is presented, hidden unit j outputs a 1, the other hidden unit outputs are approximately 0, and the output is f(w_j) ≈ t_j.

Several methods are suggested for choosing prototypes. For continuous target functions, the best prototypes are at local extremes (peaks and dips) of the function. An algorithm producing prototypes at these locations is described.

When gains are relaxed, the network interpolates between the prototypes more smoothly so the method can be used for continuous regression problems. As in the method of section 7.2.3 there is a question of how many prototypes to choose.

Successful learning of the two-spirals problem (illustrated in figure 12.3) in about 1000 epochs was demonstrated in a 3/20/1 network. (The easy success may be attributable to the change in input representation, however, since the normalization provides a simple function of the radius r as an additional input. This makes the problem simple enough to be solved easily by a randomly initialized network.)

7.2.5 Initialization from a Decision Tree Classifier

Initialization from a decision tree is considered by Sethi [340]. A decision tree classifier performs a hierarchical series of tests on an input pattern to determine its classification. This can be represented as a tree (figure 7.1). Each node represents a test that can have one of several outcomes. The result determines which outgoing path is taken and which test is performed next. The path terminates in a leaf node when the result is an unambiguous classification. For a given input pattern, evaluation starts at the root node and follows a path to the leaf node with the appropriate classification.

Figure 7.1: Initialization from a decision tree. The function computed by the decision tree can be duplicated by a network of linear threshold units [340]. The tree computes a Boolean function of the propositions evaluated at its decision nodes so a two-hidden-layer network is sufficient if the tree's node decisions can be computed by linear threshold elements. The network can be built by inspection of the tree. Training is not required; (a) is a decision tree and (b) the equivalent network.

Since a decision tree computes a Boolean function of the propositions evaluated at its decision nodes, its function can be duplicated by a network of linear threshold units [340]. A two-hidden-layer network is sufficient if the tree's node decisions can be computed by linear threshold elements. The network can be built by inspection of the given decision tree (see [340] for details). Decision trees can usually be constructed quickly, so the overall time needed to design a tree classifier, construct the equivalent network, and fine tune it by training on examples may be much shorter than the time needed to train a network from random weights.

7.2.6 Initialization from Symbolic Rule Systems

Another method of initialization constructs a network from a rule-based solution. In [342], [375] symbolic rules are embedded in the initial structure of a neural network by translating the AND, OR, and NOT terms into corresponding network structures with appropriate weights. (See figure 4.7 for implementation of AND and OR functions.) Note that this may produce an irregular nonlayered network because the rules in the original system can have arbitrary dependencies. Additional links with small random weights are provided to let the system add new terms that may be useful. Small values are used so the embedded rules dominate initially. The network is then trained from examples to improve its performance. Because the embedded symbolic rules are often classifications, the cross-entropy error function may be preferable to the mean-squared-error function [342]. The embedded rules represent the default knowledge of the system, so if weight-decay is used it may be desirable to let the weights decay to their initial values rather than zero.

7.2.7 Initialization from Fuzzy Rule Systems

For problems where the desired input-output relationship is well understood and expressible by a small set of rules, initialization based on a fuzzy logic implementation has been suggested, e.g., [291]. This is similar to the method of section 7.2.6, but the original rules are fuzzy.

As a very simple example, if we know that the output should be large when two input variables, A and B, are large, we can prewire one of the hidden nodes to implement (A AND B) and connect it to the output node with a positive weight. (Simple logical functions such as AND, OR, and NOT are easy to implement in a single node by appropriate selection of the weights and thresholds. A AND B, for example, can be realized by connecting large positive weights from A and B and selecting the threshold so that the node is active only when both A and B are active.) The information in the rules initializes the network near a reasonably good solution which is then improved by further training on examples. Even when we don't have complete knowledge about the desired solution, we may have knowledge about certain aspects of it that can be "injected" into the network in this way.

7.2.8 Remarks

Bias It should be noted that basing the initial state on a solution provided by another method introduces bias which may remain in the final network. Subsequent training may amount to mere second-order adjustments of the initial solution so a network initialized from a nearest-neighbor classifier, for example, may end up with classification properties similar to the nearest-neighbor classifier.

This is desirable in some cases because it provides some assurance that the network will behave properly in critical situations and it helps in understanding the trained network. When the initial solution is satisfactory, implementation as a neural network may be useful simply to gain properties such as parallel operation and some fault tolerance. In cases where the objective is simply to accelerate learning however, the bias may be a problem if the initial solution is only a poor guess.

Effect of the Optimization Procedure The training procedure has some effect on how much of the initial bias will be retained in the final network. In cases where it is desirable to preserve the basic form of the initial solution, small learning rates are appropriate to avoid leaving the initial basin of attraction. Local search methods (such as gradient descent) are appropriate in this case rather than truly global search techniques which ignore the initial state. If the initial solution is good, more powerful gradient techniques such as quasi-Newton methods or conjugate gradient descent may be preferable to back-propagation since their convergence behavior in the neighborhood of a solution is much better.

Chapter 7 - Weight-Initialization Techniques