Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

12.2 Cascade-Correlation

Fahlman and Lebiere [120] describe the cascade-correlation algorithm. New hidden units are added one at a time with each new unit receiving connections from the external inputs and all existing hidden units. Figure 12.2 shows the resulting (nonlayered) structure. Input weights to the new unit are chosen to maximize the correlation (covariance) of its response with the remaining output error. The unit is then added to the network and the weights from all the hidden units (new and old) to the output node are retrained to minimize the error. If the resulting error is acceptable, the network is complete; otherwise a new hidden unit is added and the process repeated.

Figure 12.2: Cascade-correlation adds hidden units one at a time. Each new unit receives connections (shown as solid dots) from the external inputs and all existing hidden units. Its input weights are chosen to maximize the covariance of its response with the residual output error. Weights into existing hidden units remain unchanged. Then the weights from all the hidden units (new and old) to the output node are retrained to minimize the error. If the resulting error is acceptable, the network is complete; otherwise a new hidden unit is added and the process repeated.

The Algorithm The algorithm begins with a network with no hidden units; inputs are connected directly to the outputs. The output nodes are usually sigmoidal, but linear nodes might be used for continuous mapping problems. The following steps are repeated until the error is acceptably small:

Train the weights of the output node(s) using any appropriate algorithm. Single-layer training rules such as the Widrow-Hoff delta rule or the perceptron learning algorithm may be used. There is no need to back-propagate errors through the hidden units since their input weights are frozen.
When no significant error improvement has occurred after a certain number of training cycles, evaluate the error. If it is small enough, stop.
Otherwise create a new unit with connections from the inputs and all pre-existing hidden units. Select its weights to maximize S, the magnitude of the covariance of the new unit's response V with the residual error

(12.2)
where o indexes the output units and p indexes the training patterns. The quantities V¯ and E¯_o are the averages of V and E_o over all patterns.
S can be maximized iteratively by gradient ascent. The derivative of S with respect to the ith weight is

(12.3)
where σ_o is the sign of the correlation between the candidate's value and the output o, f'_p is the derivative of the candidate unit's activation function with respect to the sum of its inputs for pattern p, and I_i,p is the input the unit receives from unit i for pattern p.
When S stops improving, the new unit is added to the network and its input weights are frozen.
Go to step 1.

Only the magnitude of the new unit's correlation with the residual error is important, hence the absolute value in the formula for S. If the correlation is positive, a negative output weight can be chosen to decrease the error; if the correlation is negative, a positive output weight will do.

A variation of the algorithm is to allocate a pool of candidate units with random initial input weights. Let each maximize S individually and select the best for addition to the network. This decreases the possibility of adding a useless unit that got stuck during training and it explores more of the weight space simultaneously. Other types of units besides sigmoidal (e.g., radial Gaussian) may also be included in the candidate pool.

When the output weights are being trained, all other weights are frozen. Because the activations of the hidden units depend only on the input pattern and do not change when the output weights change, there is no need to recalculate their response to each input pattern. If sufficient memory is available, the hidden unit responses can be stored in an array for quick retrieval rather than recalculated with each pattern presentation. This can significantly speed up simulations of large networks.

Remarks

There is no need to guess the best architecture before training. Cascade-correlation builds reasonable, but not optimal, networks automatically.
Input weights for new hidden nodes are chosen to maximize S, the covariance of the node response with the remaining output error. This is not the same as minimizing the error and won't be optimal in general.
The procedure generally doesn't find the smallest possible network for a problem and has a tendency to create deep networks so a final pruning phase may be desirable.
Cascade-correlation learns quickly. Unlike back-propagation, training doesn't slow down dramatically as the number of hidden layers increases because only the output weights are retrained each time. For the problems studied, the learning time in epochs grows approximately as N log N where N is the number of hidden nodes finally needed to solve the problem [120]. On the two-spirals problem (figure 12.3), an average of just 1700 epochs was needed.
Although cascade-correlation learns quickly, it can overfit the data [88]. Pruning or early-stopping based on cross-validation may be necessary to avoid overfitting; however, reasonably good generalization as measured by insensitivity to input noise was found in [149]. There, nets created by cascade-correlation tended to have saturated hidden nodes whose values change little when small amounts of noise are added to the inputs.

Figure 12.3: The two-spirals problem [233] is sometimes used as a benchmark for constructive algorithms because it requires careful coordination of many hidden nodes and is difficult to learn with simple back-propagation in a MLP network. (It is not representative of most real-world problems, however.) In a single-hidden-layer architecture, 40 or more hidden nodes are generally needed and training times are long. Most successful solutions use more than one hidden layer; some use short-cut connections.

Hidden-unit input weights are frozen after the unit is added, so features detected by the unit will not be unlearned if the network is retrained with new data. This helps stabilize learning if the training data changes over time.
"Divide and conquer," a similar method, is described in [322]. Unlike some other algorithms, it can create multiple hidden layers. Unlike cascade-correlation, it can create hidden layers with more than one node and it doesn't use a correlation measure. Like cascadecorrelation, nodes are trained one-at-a-time with weights in other nodes held constant. As a result, back-propagation through hidden nodes is never necessary.
A cascade-correlation architecture for recurrent networks is described by Fahlman [119]. A procedure similar to cascade-correlation, but using error minimization rather than covariance maximization is described by Littmann and Ritter [248]; benchmarks of cascade-correlation are also included.

Chapter 12 - Constructive Methods

12.2 Cascade-Correlation

Remarks