Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

10.2 Factors Affecting the Choice of a Method

Once an objective function has been chosen, many optimization algorithms may be applied. A few well-known methods are mentioned in the following sections. The list is not exhaustive by any means. We attempt to describe the most popular algorithms broadly, including the main assumptions behind them and their main advantages and disadvantages for neural network training. We do not attempt to give full implementation details or to describe special cases which need to be considered in real applications. Details of classical algorithms can be found in many texts, for example [138], [311], [302].

A minor point: optimization can be viewed as maximizing utility or as minimizing cost. To simplify the discussion, we consider it as minimization.

Ideally, all optimization routines would yield the same result, the global minimum of the error function. In practice though, some methods tend to work better than others in particular situations. There is generally a trade-off between speed, robustness, and the probability of finding the global minimum. Specialized algorithms may be very fast in certain situations, but not robust in other situations. Many methods embody assumptions about the problem that allow efficiencies to be obtained when the conditions hold, but may lead the algorithm astray otherwise. Algorithms that promise to find a global minimum tend to be much slower than less ambitious methods; most are relatively robust, however, because they make few assumptions about the problem.

Selection of an appropriate algorithm depends on many factors. Some include:

Differentiability. Continuous functions allow the use of efficient gradient methods. If the function is differentiable everywhere and the derivatives are easy to evaluate, conjugate gradient methods are often recommended if local minima are not an extreme problem.

If the function is not differentiable or derivatives are not available, then less efficient evaluation-only methods may be necessary. Gradient methods can be used with gradient estimates obtained by finite-differences. If the function is known to be smooth but derivatives are not available, the conjugate-directions method is often recommended. If the function is not smooth or has many local minima, then global search methods such as simulated annealing and genetic algorithms may be required.

Classification versus regression. In classification problems, the exact numerical output is not important as long as the class is indicated unambiguously. In continuous function approximation problems, the numerical values are important and target values must be matched closely to obtain a small error.

The advantages of methods using second-order gradient information seem to show up mostly in the latter case. That is, they excel at homing in on the exact minimum once the general basin of attraction has been found, but they are not really much better at finding the basin than other, simpler methods when the function is very nonlinear.

Local minima. If there are relatively few really poor local minima, then random restarts of a local search algorithm may be sufficient. If there are many poor local minima, consider a global search method such as simulated annealing. Stochastic algorithms like simulated annealing promise convergence to the global minimum with probability 1, but can be very slow.

Problem size. Algorithms that scale poorly may be adequate for small problems but become impractical for large networks with large training sets. Second-order methods requiring exact evaluation of the Hessian are generally impractical for large networks. Storage and manipulation of large matrices may also be a problem even for methods that approximate the Hessian.

Robustness. Does the algorithm work well when preconditions are not satisfied exactly and when parameters are not optimally tuned? In general, robust algorithms make fewer assumptions and so tend to be slower than more optimistic algorithms (which are suited to the given problem).

Meta-optimization. Back-propagation, for example, has a learning rate that must be selected; if momentum is used, it also must be given a value. Other algorithms have their own parameters that must be selected. Tuning these parameters may be a meta-optimization problem in itself and algorithms that are overly sensitive to parameter choices will be difficult to use for general problems. Meta-optimization may not be feasible for large problems where each training session takes days or weeks.

Is the data noisy? If the data is noisy and there are not enough examples to suppress noise by averaging, then the observed errors must be viewed as random estimates and precise minimization of the training error will not necessarily correspond to minimization of the true expected error.

This is a generalization problem. If the problem is addressed by adding regularization terms to the objective function, then precise minimization may have real benefits and second-order methods may have advantages. If the problem is not addressed at all, then back-propagation (and other simple gradient descent methods) may be better because of their relative inability to locate the precise minimum. If the problem is addressed by early stopping, then the search may never enter the precise-minimization phase where second-order methods excel.

Clustering. Is the data smoothly distributed or clustered? If the data falls in well-separated clusters that correspond to target classes, learning is usually easier because fine distinctions are not required. As long as the decision boundary falls in the open space between clusters, its exact location is not critical so more solutions are feasible. When the clusters are well separated, on-line methods may be faster than batch methods because of redundancy in the data; almost all the information is contained in the cluster centroids.

In general, optimization algorithms can be classified as deterministic or stochastic. Most deterministic optimization algorithms can be classed as evaluation-only methods or gradient-based methods. The advantage of evaluation-only methods is their simplicity; no gradient calculation is required. They are most useful where internal system details are not accessible, where the objective function is not differentiable, or in complex systems where derivatives are difficult to calculate correctly. Their main disadvantage is inefficiency. It is almost always worth using gradient information if it can be obtained because convergence is usually much faster; the gradient points in precisely the direction that increases the function the fastest, after all. Because the gradient is easily calculated (by the back-propagation step) in MLPs with differentiable node nonlinearities, most neural network training algorithms use it. Even more information is available in second-order derivatives so methods that use the Hessian matrix, or approximations to it, can be very efficient under certain conditions. They may be impractical for large problems, however, because of storage and computation requirements involved in dealing with large matrices.

A problem with deterministic algorithms is that they always converge to the same endpoint given the same starting point. The system will converge either to a local minimum or to the global minimum depending on the starting point. When local minima are rare and the basin of attraction of the global minimum is large, a few repetitions of the algorithm with different starting points may be enough to find good solutions. In other cases, more powerful methods are needed. The advantage of stochastic methods is that every state has a nonzero chance of occurrence so, if the procedure runs long enough, the global minimum will be visited eventually. Under certain conditions, many stochastic algorithms can promise a high likelihood of convergence to the global minimum. The problem is that this may take a very long time and guarantees are lost if the algorithm is terminated early.

Chapter 10 - Classical Optimization Techniques

10.2 Factors Affecting the Choice of a Method