Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

5.7 Training Time

Although back-propagation has been used successfully on a wide range of problems, one of the common complaints is that it is slow. Even simple problems may take hundreds of iterations to converge and harder problems may take many thousands of iterations. Training times of days or even weeks are not unusual for large practical applications.

Much work, therefore, has been done in search of faster methods. Effects of the learning rate and momentum are considered in chapter 6. Methods of weight initialization are considered in chapter 7. Chapter 9 summarizes a number of variations of the back-propagation algorithm intended to reduce training times and chapter 10 reviews some classical optimization methods that have better theoretical convergence properties.

The size of the training set obviously affects training times because each pass through the data takes twice as long when the data set is twice as big. Additional patterns that contain no new information may simply make learning slower. This depends on the training method and cost function, among other things. In batch-mode, each pass is slower but with a SSE cost function the weight changes will be twice as large so learning may converge in half the number of epochs if it remains stable. In on-line mode, each pass is slower but twice as many weight updates are done in each pass so the overall time may not change. Less obviously, the additional patterns may supply new information that restrict possible solutions and make the problem harder so that more iterations are required or they may supply missing information that makes the problem easier.

5.7.1 Scaling of Training Time

Hinton [169], section 6.10] argues that a network with W weights typically requires O(W³) training time on a serial machine. O(W) cycles are required for each forward and backward propagation of a single pattern, O W) training patterns are typically needed to achieve good generalization, and (perhaps) O(W) weight updates are required for each pattern. Implementation on parallel hardware would reduce this only by a factor of W resulting inO(W²)training time.

Tesauro and Janssens [367] observed that training time for the parity problem increases approximately as 4^d where d is the number of inputs (the predicate order). Although some problems are easy, Judd [202], [203] has shown that the general problem is NP-complete. That is, the training time scales exponentially with the problem size in the worst case. Networks of hard-limiting linear threshold elements are considered, but the results seem applicable to networks of sigmoidal units.

This problem is not unique to back-propagation, of course. Optimization methods cannot be held responsible for the rapid growth of the search space with the size of the problem. It does suggest, though, that a simple algorithm like back-propagation alone will not be adequate to find solutions for hard problems. Indeed, when solutions to interesting problems are found, the critical factor is often external information supplied by the network designer in selecting a network architecture, collecting and editing relevant data, choosing input-output representations, selecting an error function, and so forth.

Some possible causes of slow training include:

Overly restrictive convergence criteria. If the trained network will be used for classification purposes it usually isn't necessary to train every output to within 10^-6 of the target value.
"Paralysis" due to sigmoid saturation.
Flat regions in the error surface where the gradient is small.
Ill-conditioning of the Hessian matrix (the matrix of second derivatives of the error with respect to the weights).
A poor choice of parameters such as learning rate and momentum.
The simple-mindedness of gradient descent: more sophisticated algorithms may use available information more efficiently.
The global nature of sigmoid functions. A change in one weight may alter the network response over the entire input space. This changes the derivatives fed back to every other weight and produces further weight changes whose effects reverberate throughout the network. It takes time for these interactions to settle.
The size and distribution of the training set.
Poor representations, irrelevant inputs.
Delta attenuation in deep networks.
Poor network architectures. The minimal-size network just adequate to represent the data may require a very specific set of weights that may be very hard to find. Larger networks may have more ways to fit the data and so may be easier to train with less chance of convergence to poor local minima.

Many of these factors are related of course. Sigmoid saturation may cause flat spots in the error surface, for example, which are reflected in a poorly conditioned Hessian, which, in turn, makes it difficult to choose a good learning rate. Some of these factors are due to weaknesses of back-propagation and might be avoided by algorithmic improvements while others are more fundamental.

One way to reduce training times is to increase the efficiency of the optimization procedure. Chapter 6 discusses the effects of the learning rate and momentum parameters and gives hints for selecting reasonable values. Chapter 9 summarizes variations of the back-propagation algorithm intended to improve training times; many are techniques for adaptively controlling the learning rate. Chapter 10 reviews more sophisticated optimization methods that are reputed to have better convergence properties than gradient descent.

Another way to improve training times is to give the network a headstart on a good solution. Chapter 7 discusses some network initialization techniques based on this idea.

Yet another way to reduce training times is to identify and fix the problems that cause slow convergence. The structure of the E(w) landscape has a fundamental effect on training time; some common properties are discussed in chapter 8. Many of these are reflected in numerical properties of the Hessian matrix discussed in section 8.6. Delta-attenuation is described in section 6.1.8 in conjunction with learning-rate adjustment methods.

Finally, another way to speed things up is to change the problem. Part of the reason for slow learning is that the algorithm is limited to adjusting existing weights. If the network structure is poorly matched to the problem, this may be very difficult. Algorithms that are free to change the structure during learning are often able to learn much faster. Chapters 12 and Chapter 13 discuss constructive algorithms, which "grow" networks to fit the problem, and pruning algorithms, which reduce large networks to fit the problem.

Paralysis In some cases, long learning times can be attributed to paralysis due to sigmoid saturation. That is, the sigmoid and related functions have nearly flat tails where the derivative is approximately zero for large inputs (positive or negative). Because δ_i is proportional to the slope f'_i, this leads to small derivatives for weights feeding into the node and so on backward through the network. If many nodes are saturated, then weight derivatives may become very small and learning will be slow. In digital simulations, deltas may become so small that they are quantized to zero and learning stops (double precision arithmetic is sometimes recommended for this reason).

Large external inputs or large weights are a typical cause of saturation. Normalization of the inputs to a reasonable range and initialization with small random weights are standard remedies. The weight initialization range required to avoid saturation depends on the number of inputs to a node and their correlations so different ranges may be needed in different cases. Chapter 7 discusses some weight initialization techniques based on this idea. Gain-scaling (section 8.7) has also been suggested as a way to avoid and correct for saturation.

Regardless of how the network is initialized, the weights change during learning so saturation may become a problem at a later stage. Because paralysis can have such a strong affect on learning time, it is helpful to detect and correct it before it becomes serious. In software simulations, it is relatively easy to check for paralysis after each pattern presentation [382]. If a significant fraction (e.g., 1%) of nodes are near saturation (e.g., have absolute magnitude greater than 0.9 assuming tanh nodes), then steps can be taken to fix the problem (e.g., reduce the learning rate, reduce the sigmoid gain, or scale the weights). The computational cost of the test is insignificant so it can be done often to allow detection of imminent paralysis before it becomes a problem.

Chapter 5 - Back-Propagation

5.7 Training Time

5.7.1 Scaling of Training Time