Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

10.8 Discussion

10.8.1 Are Assumptions Met?

When the objective function has special characteristics, specialized algorithms can often be developed to take advantage of them and perhaps obtain great efficiencies relative to more general algorithms. The problem is that although they may be very fast in the situations they were developed for, they may not be robust. That is, the assumptions they make may lead them astray and they may actually do worse than less specialized methods if their preconditions are not satisfied.

A number of methods such as conjugate gradient descent assume a quadratic function and exploit its special properties to achieve fast convergence. Unfortunately, neural network error surfaces may be highly nonlinear and are definitely not quadratic in the large scale. (Chapter 8 discusses properties of the E(w) error surface that may cause problems—the existence of large flat areas separated by steep cliffs, for example.) Quoting from [302], [313]: "Quadratic convergence is of no particular advantage to a program which must slalom down the length of a valley floor that twists one way and another (and another, and another, … —there are N dimensions!). Along the long direction, a quadratically convergent method is trying to extrapolate to the minimum of a parabola which just isn't (yet) there; while the conjugacy of the N - 1 transverse directions keeps getting spoiled by the twists."

Another problem with methods based on a quadratic approximation is that the Hessian is often ill-conditioned in neural network training problems [37], [331] and so higher order methods may be no more efficient than simpler methods. When the number of weights exceeds the number of training samples, an outer-product Hessian approximation will be rank deficient. Sigmoid saturation may also lead to effective loss of rank. Ill-conditioning could lead to numerical instability or a large step that takes the system out of the region where the local approximation is valid. This is a well-known problem with Newton's method and there are standard fixes, but these add complications beyond the pure algorithm. The point is that a straight-forward implementation of a pure algorithm will not always work better than a simpler method unless these sorts of complications are addressed.

The quadratic approximation is usually valid in the neighborhood of a minimum though so it may be useful to use a more robust method for initial optimization, followed by a few iterations of a fast second-order method to tune the solution. Some theoretical justification is given in [394].

10.8.2 Back-Propagation Is Sometimes Good Enough

It has been said that "back-propagation is the second-best method for everything." That is, there are many algorithms which are faster and give better results in particular situations, but back-propagation and simple variants often do surprisingly well on a wide range of neural network training problems.

Standard optimization methods have been considered for neural network training in many studies. Most report faster training times, smaller errors, and better chance of convergence. This might leave the impression that they are uniformly better. There are cases, however, where back-propagation and its variations are faster or less prone to convergence to poor local minima than more sophisticated algorithms which should be better in theory. In [77], for example, several second-order methods are compared to back-propagation on a simple problem (seven weights plus an additional scaling parameter; the target is a sine function). All achieved smaller training errors, but all took more time. Alpsan et al. [9] evaluated approximately 25 different optimization techniques, including numerous variations of back-propagation, on a single real-world classification problem (classification of brainstem auditory evoked potentials). Quoting from the abstract [9]:

It was found that, comparatively, standard BP was sufficiently fast and provided good generalization when the task was to learn the training set within a given error tolerance. However, if the task was to find the global minimum [i.e., reduce the error to very small value], then standard BP failed to do so within 100000 iterations, but first order methods which adapt the stepsize were as fast as, if not faster than, conjugate gradient and quasi-Newton methods. Second order methods required the same amount of fine tuning of line search and restart parameters as did the first order methods of their parameters in order to achieve optimum performance.

In the same study, second-order methods showed a greater tendency to converge to local minima and the solutions found generalized worse than those found by first order methods. In other remarks, "None of the more sophisticated second order methods were able to learn [to classify] the training set faster than BP" [9]. Similar results have been reported elsewhere.

Part of the relative success of back-propagation may be due to its simple-mindedness; it makes very few assumptions. Part may also be due to special characteristics of the neural network training problem, for example, the shape of the typical error surface, that conflict with assumptions used by more sophisticated methods developed for other purposes. The remarks in section 10.5.3 about possible causes for the difference in performance on classification and continuous function approximation apply to most methods which assume a quadratic approximation. Basically, in classification, training is often stopped when all output errors are less than a certain value (e.g., 0.2) so the search may end before the quadratic approximation becomes valid and the advantages of the specialized algorithms don't have a chance to come into play. This may also be true to a lesser degree even for regression problems where the error tolerance is smaller. Early stopping based on cross-validation tests may also end the game before second-order methods become advantageous.

Part of back-propagation's success may also be due to network designers adapting to the algorithm. That is, it has been found that tricks such as input variable normalization, the use of tanh instead of sigmoid node functions, the use of {-0.9, 0.9} targets instead of {0, 1}, the use of larger-than-necessary networks with early stopping and/or pruning, and so on seem to make learning easier and many of these have become standard practice. Some might view this as cheating because it changes the problem to fit the optimization method, but there is certainly no reason to intentionally design networks that are hard to train. Often, when such steps are taken, back-propagation or a simple variant may outperform more sophisticated methods. Some papers comparing conjugate gradients to back-propagation, for example, report on the order of 50,000 epochs to train a 4-4-1 network on the 4-input XOR problem. Back-propagation may actually take this long for the parameters used, but when simple fixes like those mentioned previously are taken, the problem becomes very easy and can generally be solved by back-propagation in a few hundred epochs. 40–50 epochs is typical for this problem with Rprop. On the simpler problem, conjugate gradient is not even as fast as back-propagation because it wastes evaluations in precise line searches.

Finally, aside from performance issues, there are sometimes good reasons for preferring simple algorithms.

Simplicity may be important because computational resources are limited.
Local algorithms are preferable for integrated-circuit implementation. Complicated matrix manipulations are not feasible on analog retina chips, for example. Simple methods like on-line back-propagation use information that is locally available at the weights being modified.
Training time may not be a major consideration because it is usually a one-time procedure. The environment may change very slowly so frequent retraining is not necessary.
In some studies, the algorithm is intended to be feasible in terms of what real neurons can do. The fact that more sophisticated algorithms exist is not relevant if they are not used by the system under study.
Many sophisticated algorithms simply do not give better or faster solutions than simple variations of back-propagation on the problems considered.
Some specialized algorithms are not robust. They do not work in all situations and break down if assumptions are not met. Simple implementations that ignore complications such as round-off error may not work well.
More sophisticated algorithms often do not yield good generalization. Second-order methods can often achieve much smaller training set errors than back-propagation, but this may simply amount to overfitting when training data is limited.

10.8.3 Remarks

It is easy to get side-tracked into tinkering with optimization methods. It should be remembered that the optimization method is of secondary importance and factors such as representation, network structure, and choice of error function are more fundamental. If solutions are poor or training times are long, the problem could be more basic than the way the weights are tuned. A good optimization algorithm cannot fix problems introduced by a representation that hides needed information or an error function that does not measure the appropriate thing, after all. Sophisticated algorithms have their place, but they should generally be considered only other options have been exhausted.

Still, efficiency is important, especially for large networks with large data sets or in cases where many nets will be trained on similar data, so there are situations where it pays to "optimize the optimization method." If sophisticated algorithms are needed, consider using one of the highly efficient "canned" optimization programs that are available rather than writing from scratch. The best programs are already debugged, handle many special cases, and correctly deal with important implementation details that are seldom considered in simple descriptions of the pure algorithms. Some can switch between robust and specialized methods when appropriate. Most programs offer a selection of methods. Generally, it is necessary to understand something of the methods used by the program in order to fit the technique to the problem.

Other Notes

Conjugate gradient descent and Levenberg-Marquardt are the classical methods most often mentioned for neural network training. Both are discussed in detail in this context in [258].
Second order methods require the Hessian matrix or an approximation. Some possibilities are described in section 8.6.
In several places above, inversion of the Hessian or other large matrixes is mentioned as a way of obtaining a solution. In practice, other methods that can handle rank-deficient and poorly conditioned cases are preferred. Techniques are discussed in numerical analysis texts.
In optimization, it is usually desirable to find a solution as accurately as possible and overfitting is not considered directly. That is, the objective function is considered to be a completely accurate measure of what is sought; if generalization is one of the goals, it is assumed to be reflected in the objective function (e.g., via penalty terms). Similar considerations are encountered in the development of robust algorithms and the analysis of the sensitivity of the solutions obtained.
It should be noted that if training is stopped when the error is less than a given tolerance, then it is usually the case that E ≠ O and at the stopping point. This is contrary to the assumptions of some analyses. Some pruning algorithms, for example, assume
There is nothing in these techniques that limits them to multilayer network structures. They can be applied to most other (continuous) neural network models as well.
The optimization approach to training makes no claim of biological plausibility. Many of the methods are nonlocal, complex, and batch-oriented. The goal is simply to find a good set of weights for the problem at hand. Although back-propagation is not considered to be a particularly plausible model of learning in biological networks, it is at least driven by local computations.
Neural networks are sometimes used as optimizing systems themselves, for example, the proposed use of Hopfield networks for solving the traveling salesman problem. The application of neural networks to problems normally cast as optimization or signal processing problems is considered in detail by Cichoki and Unbehaven [82].
Shanno [341] discusses recent work in optimization methods in light of its utility to neural network training.

Chapter 10 - Classical Optimization Techniques

10.8 Discussion

10.8.1 Are Assumptions Met?

10.8.2 Back-Propagation Is Sometimes Good Enough

10.8.3 Remarks

Other Notes