Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

9.11 Remarks

All the methods summarized in this chapter were proposed to accelerate learning. It should be remembered that there are other factors affecting learning time that have not been considered here. As noted earlier, things such as network structure, data representation, choice of error function, and so on may have much stronger effects on performance and training time than the optimization method. Standard practices like the use of momentum, the use of tanh rather than sigmoid nodes, centering and normalization of inputs and outputs, the use of on-line versus batch updates, and so on also affect training times. If training time is a concern, it is best to explore these options before looking for fast training methods. Still, it may be necessary to train candidate networks in the process of comparing these factors and adaptive learning rate algorithms are a reasonable compromise between speed and robustness.

Often, adaptive learning rate methods are not any faster than standard back-propagation with optimally tuned parameters [239]. Even so, they effectively automate the search for good parameters and so may be more reliable and much easier to use. When the time needed to select optimal parameters by hand is considered, adaptive methods may retain the speed advantage. In any case, they are usually much faster than back-propagation with poor parameter choices.

A potential problem with some adaptive methods is that they introduce additional parameters that need to be tuned. In the worst case, it may be no easier to find a good set of parameters than it is to find a good set of parameters for standard back-propagation. Another concern is that they may require storage of more information. Delta-bar-delta, for example, stores a separate learning rate and δ for each weight. This is not a problem in small computerized simulations, but it may be a factor in applications using limited hardware (e.g., custom integrated circuits). Standard on-line back-propagation requires the least amount of storage.

Alpsan et al. [8] asked if modified back-propagation algorithms were worth the effort and concluded that many were not. They note that optimally tuned back-propagation is often as fast as any other method and that many of the adaptive methods are sensitive to parameters and no easier to tune than standard back-propagation. They considered delta-bar-delta, superSAB, and Vogl's method, among others, but not Rprop or quickprop.

At this point, Rprop and quickprop seem to be the favored methods. Rprop has fewer critical parameters and may be more reliable in general. Other methods will often do better on specific problems, however, so it may be worth experimenting.

When training time is very important, it is worth considering the standard optimization algorithms, some of which are reviewed in chapter 10. These may be much faster than simple variations of back-propagation in some cases. This is somewhat problem dependent, of course. The second order methods seem to be most helpful in the final stages of function approximation problems where it is necessary to reduce the error to very small values. Methods like conjugate gradient descent or Newton's method will converge very quickly in the neighborhood of a local minimum, but they are not necessarily any faster (and may be slower) than simpler first order methods in the initial search stages. For classification problems where training is stopped as soon as all outputs are correct within a tolerance, for example, 0.2, of the target values on all patterns, the methods of this chapter may be as fast or faster than conventional second-order optimization methods. If it is necessary to continue the search to locate the minimum very precisely, then it may be worth switching over to a more sophisticated second order method for the final tuning.

If training speed is extremely critical, it may also be worth considering a completely different sort of approximation system since back-propagation training of MLP networks is one of the slowest training methods for any approximation system [239]. Many other approximation methods can achieve similar error rates (on suitable problems) with much shorter training times. Nearest-neighbor methods, for example, require almost no training time (simply store the patterns) but have longer recall times. Decision trees and parametric classifiers can also be developed quickly when they are applicable. Within neural network models, alternatives include radial basis function networks, LVQ, and ART networks.

Chapter 9 - Faster Variations of Back-Propagation

9.11 Remarks