Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

12.5 Marchand's Algorithm

Marchand, Golea, and Ruj´an [256] describe a method for constructing a one-hidden-layer network of linear threshold units to solve binary mapping problems (figure 12.6). It is always possible to classify N input patterns by creating N hidden nodes, each of which recognizes one of the patterns. The network would act like a look-up table and the number of nodes needed would grow linearly with the size of the problem. But this fails to capture correlations in the training data and the resulting network does not generalize well. Usually it is better if each hidden unit recognizes as many patterns as possible.

The algorithm described by Marchand, Golea, and Ruján adds hidden units sequentially. The weights of each new unit are chosen to split a group of patterns with like targets from the rest of the data. On one side of the hyperplane defined by the unit, all patterns have the same target value; on the other side, target values may be mixed. The separated patterns are then removed from the working set and the procedure repeated with additional hidden units. Each new hidden unit slices off another set of training patterns that share the same target. The procedure stops when all remaining patterns have the same target.

Figure 12.6: The algorithm of Marchand, Golea, and Ruj´an [256] creates a single-hidden-layer network of linear threshold units by adding hidden units sequentially. Each new unit slices off a group of training patterns that share the same target value. Filled and empty circles represent training points with positive and negative targets. Lines show the hidden unit decision surfaces; the numbers to the side indicate the order in which the hidden units were created. All points are classified correctly, although some are very close to the boundary.

It is always possible to separate at least one pattern from the rest so in the worst case, no more than N - 1 nodes will be needed to recognize N patterns. In practice though, the patterns are often correlated because of regularities in the target function and clustering in the input distribution. Most slices can then remove more than one pattern so fewer than N - 1 nodes will be required in general.

The procedure sequentially creates h hidden units, which partition the input space into a number of regions, each containing one or more training patterns that share the same target. The resulting internal representation is linearly separable in that the desired target values can be generated from the hidden unit activities with no need for more hidden layers. The weight u_j from the jth hidden unit to the output unit can be found by the perceptron algorithm or it can be set as where S_h is the target of the (h + 1)th cluster. These weights increase exponentially; the perceptron algorithm will generally find a different set.

(12.6)

(12.7)

Selection of Hidden Unit Weights To obtain small networks, it is desirable for each hidden unit to slice off as many patterns as possible that share the same target. The following greedy procedure is simple, but not always optimal.

When adding hidden unit k, the working space has N⁺ patterns with positive target +1 and N^- patterns with target -1. The procedure in [256] tries to find two weight vectors: one that excludes the largest number M⁺ of positive patterns and one that excludes the largest number M^- of negative patterns. The vector with the largest ratio M⁺ /N⁺ (or M^- /N^-) is then chosen.

To find the weight vector that maximizes M⁺, a pattern ξ^μ with a positive target is chosen and weights are selected so that the unit has a +1 output for this pattern and a -1 output for all other patterns, w_kj = ξ^μ_j, j = 1,…N_i, with bias w_ko = 1 N_i. At this point, all -1 patterns and pattern ξ^μ are correctly classified, but some of the other +1 patterns may be misclassified. One of the misclassified patterns is chosen and the perceptron rule is used to change the weights to also correctly classify the additional pattern without causing ξ^μ and the -1 patterns to be misclassified. If it succeeds, the change is accepted and the algorithm goes on to try another misclassified pattern. If it fails, another pattern is tried. After all the misclassified +1 patterns have been considered, the vector separates a certain number v₁ of the +1 patterns, but some are still misclassified (unless the patterns are linearly separable). This current weight vector is saved and all the properly classified +1 patterns are removed from the working set. The procedure is then repeated starting with another misclassified pattern to generate another weight vector. All +1 patterns, including the ones excluded by the first vector, are considered when computing the number v₂ excluded by the second vector. v₂ and v₁ are compared to choose the better weight vector. This is repeated until the best weight vector is found. A similar procedure maximizes M^-.

Remarks Simulation results for the parity function, random binary functions, and the mirror symmetry problem are discussed by Marchand, Golea, and Ruj´an [256]. (The target in the symmetry problem is +1 if a vector of binary bits is symmetric about its center.) The net found for the symmetry problem was optimal; a similar solution could not be found by the tiling algorithm.

The search for hidden unit weight vectors may take a long time because it searches to find the one vector that excludes the largest number of patterns. The procedure starts with one pattern, say p_start, and scans all the other (positive) patterns to see if they are also separable with p_start. Then it increments p_start to the next pattern. This is a loop over all patterns inside another loop over all patterns so there are O(M²) steps (where M is the number of patterns), each of which calls the perceptron learning algorithm. For many problems, however, the running time seems reasonable compared to back-propagation.

Chapter 12 - Constructive Methods

12.5 Marchand's Algorithm