Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

10.5 First-Order Gradient Methods

The main disadvantage of evaluation-only methods is their relative inefficiency. When gradient information is available, it is almost always worth using because it tells exactly which parameter changes will minimize the error most at the current point. In digital simulations at least, it is easy to calculate the gradient for MLPs with differentiable node nonlinearities so most training methods are gradient based.

10.5.1 Gradient Descent

As noted, back-propagation is a variety of gradient descent. In gradient descent, new points are obtained by moving along the (negative) gradient direction. That is,

(10.2)

where g =

is the gradient, E(w) is the error function evaluated at w(t) and η is a step size or learning rate parameter. For a true approximation to gradient descent, η → 0 should be very small (figure 10.2a), but larger values are often used in practice to speed-up convergence. Batch-mode back-propagation with a small learning rate is a specialization of gradient descent to MLPs where the back-propagation step is just a way of calculating the gradient by application of the derivative chain rule in the MLP structure. Back-propagation and its variations are discussed in chapters 5 and 9.

Gradient descent is not highly regarded in the optimization community mainly because of its slow rate of convergence. In theory, the asymptotic rate of convergence is linear. That is, the error is reduced by a constant factor at each step, |e_k+1| ≤ C|e_k| where C < 1 is a constant. One problem is that the gradient never points to the global minimum except in the case where the error contours are spherical so many small steps are needed to arrive at the minimum. Another problem is that the gradient vanishes at a minimum so Δw approaches 0 and final convergence is very slow. Slow convergence is also a problem when the Hessian is poorly conditioned, that is, when the gradient changes rapidly in some directions but slowly in others. This is the case in so-called ravines of the error surface. When the step size is too large, the weight state may bounce from one side of the ravine to the other while making only slow progress along the ravine towards the true minimum (figure 10.2b), an effect sometimes called cross-stitching. (Section A.2 discusses the convergence rate of gradient descent in linear problems.) The possibility of convergence to local minima is another common criticism applicable to all gradient-based methods.

Figure 10.2: Gradient descent trajectories: (a) With a small step size (0.01), gradient descent follows a smooth trajectory, but progress may be very slow. (b) Cross-stitching: When the step size is too big (0.1) and the error surface has valleys, the trajectory may oscillate wildly from one side to the other while creeping slowly along the length of the valley to the minimum. Once it reaches the minimum, it may overshoot many times before settling.

The learning rate η is a parameter that must be selected. When it is too small, convergence will be very slow. When it is too large, the procedure may diverge (see figure 10.2). In practice, noninfinitesimal values of η are used in order to speed-up convergence and true gradient descent is only approximated. This is an unimportant detail in most cases. Stability requires that 0 < η < 2/λ_max, where λ_max is the largest eigenvalue of the Hessian matrix. The optimum value is η = 1/λ_max and with this choice, the convergence rate is governed by the slowest time constant

where λ_min is the smallest nonzero eigenvalue (section A.2). When the Hessian is badly conditioned (i.e., nearly singular) the time constant can be large and convergence very slow. For nonlinear functions, the Hessian changes as the weight vector moves over the error surface. It is expensive to reevaluate the Hessian at each point so either a small learning rate is chosen arbitrarily or η is adjusted heuristically.

For best performance, different values of η are appropriate at different points on the error surface. Moderately large values are useful to reach the vicinity of a minimum quickly, but care has to be taken to avoid node saturation. Once a minimum has been located approximately, smaller step sizes are needed to avoid cross-stitching and allow the system to settle to a minimum. This is the situation where second-order methods usually outperform gradient descent. Effects of different learning rates on back-propagation are described in chapter 6. Variations of back-propagation that tune η dynamically are described in chapter 9.

Figure 10.3: Gradient descent trajectories with momentum. With momentum, opposite (side-to-side) changes tend to cancel while complementary changes (along the length of the valley) tend to sum. The overall effect is to stabilize the oscillations and accelerate convergence: (a) When the step size is small, momentum acts to accelerate convergence (step size 0.01 and momentum 0.99, cf. figure 10.2a); (b) small amounts of momentum also help to damp oscillations when the step size is too large (step size 0.1 and momentum 0.2, cf. figure 10.2b).

The use of momentum is common in back-propagation. This can be thought of as gradient descent with smoothing controlled by the momentum parameter, 0 < α < 1. Briefly, the weight update rule is

(10.3)

where, as before, g is the gradient of the error evaluated at w(t) and η is the step size or learning rate parameter. When the step size is too large, momentum helps to suppress cross-stitching because consecutive opposing changes tend to cancel while complementary changes sum. That is, for α →1^-, the side to side oscillations across the valley effectively cancel while the components along the axis of the ravine add up. When the step size is too small, on the other hand, momentum helps accelerate learning by amplifying the effective learning rate. When the gradient is constant, for example, the effective learning rate is η' = η/ (1 - α). Figure 10.3 illustrates these effects. Momentum is discussed in more detail in section 6.2.

10.5.2 Best-Step Steepest Descent (Cauchy's Method)

Simple continuous gradient descent is the rule "wherever you are, evaluate the gradient and take a step down it." When the step size is infinitesimal, this produces a curved path that is orthogonal to the contours of the error surface at all points (figure 10.2a). This is slightly different from optimal, or best-step, steepest descent. As the term is used in the optimization community, this seems to mean Cauchy's method, fouses the rule "wherever you are, evaluate the gradient and then move along that line until you find a minimum." In each major iteration, it evaluates the gradient once and then does a line search to find the minimum along that line (figure 10.4).

Figure 10.4: Cauchy's method uses the rule: wherever you are, evaluate the gradient and then move along that line until you find a minimum.In each major iteration, it evaluates the gradient once and then does a line search to find the minimum along that line.

Although this may be useful in problems where the gradient calculation is expensive, it doesn't seem to offer many benefits for (simulated MLP) neural networks in which the gradient calculation has basically the same cost as a function evaluation. The minimum on the line has no special significance and the calculations used in locating it precisely are basically wasted. Few recommend the method, but it is often used for comparison to show the benefits of more sophisticated methods. In neural network simulations, it is usually slower than simple gradient descent.

10.5.3 Conjugate Gradient Descent

Conjugate gradient descent is one of the most often recommended optimization methods for differentiable functions with many variables. Although it uses only first-order gradient information, it is nearly as fast as full second-order methods. In addition, it is practical for large problems because it avoids the need to store and manipulate the Hessian matrix by assuming a locally quadratic function and using the property of conjugate directions.

As mentioned in section 10.4.3, A set of vectors d₁, d₂,…d_r, r ≤ N, are mutually conjugate with respect to a symmetric N × N matrix H if they are linearly independent and [311]

(10.4)

Figure 10.5: The conjugate gradient method converges in just N iterations (2 in this case) for an N-dimensional quadratic error function, but each iteration involves a line search that may require many evaluations of the error function.

This is useful because an N -dimensional quadratic function determined by H can be minimized by performing independent one-dimensional searches along any set of N mutually conjugate directions. The directions are uncoupled in the sense that minimizing along direction d_i doesn't undo gains that were obtained by minimizing along direction d_j. Figure 10.5 illustrates the (two) steps taken for a two-dimensional quadratic error surface.

As in Powell's method, a set of conjugate search directions can be obtained without evaluating the Hessian matrix. In practice, the search direction d_k for step k is a combination of the current gradient g_k and the previous search direction d_k-1

(10.5)

where the subscripts index time, not elements of the vectors. It has been noted that this could be thought of as back-propagation with the momentum parameter optimally adapted at each step. Parameter γ controls how much the next search direction is influenced by the previous search direction. There are a number of variations differing in how γ is chosen

(10.6)

(10.7)

Although these are mathematically equivalent for quadratic functions, they give different results for general nonlinear functions, in which case the Polak-Ribiere form is preferred [302].

Minimization of a nonquadratic function will require more than N line searches but only N conjugate directions are available, so it is necessary to reinitialize the procedure periodically. There are various prescriptions for how and when to restart. The simplest is to reset the search direction to the gradient after N line searches.

Without restarts, later directions become linearly dependent and convergence is almost always linear [138]. With restarts, convergence is supralinear in theory, but in practice it is nearly always linear [138] because of round off errors, inexact line searches, the failure of the quadratic assumption, and so on. Still, it is usually much faster than simple gradient descent (which also converges linearly) because the scale factor may be much lower. Conjugate gradient and back-propagation were compared empirically on neural network N/2/N encoder problems in [360]; conjugate gradient was about an order of magnitude faster and back-propagation rarely converged for the harder (large N) problems, but it was estimated that both have roughly equal median time complexities. That is, training time scaled with N in approximately the same way.

The efficiency of the line search routine is critical because this is where most of the computation time is spent. Often, there are several parameters that must be tuned to obtain good performance. On simple problems, good line search routines may require as few as 3 to 4 function or gradient evaluations, but a typical search may require 10 to 20 evaluations, depending on parameter choices, nonlinearity of the function, and the precision required. It should be noted that some popular line search routines use parabolic approximations which may offer fast convergence for quadratic problems, but which may be slow or divergent for neural network error surfaces with stair-step shapes.

Scaled conjugate gradient descent [270] is a variation that eliminates the line search and all tunable parameters, but seems to have properties similar to the basic algorithm. It has been reported to be faster than the basic algorithm in some cases and slower in others. In [9], it was somewhat slower than other conjugate gradient methods.

Conjugate gradient descent has been compared with back-propagation in a number of studies [e.g.[26], [225], [27], [124], [197], [14], [19], [47], [69], [210], [75], [208], [360], [179], [296], [313], [383]. A detailed discussion and source code are provided in [258]. Most studies report an order of magnitude improvement in the number of iterations to convergence, an improved chance of convergence on hard problems, and smaller final errors. The reader should take care to note, however, if the iterations reported are the number of line searches or the total number of function (and/or gradient) evaluations. Where function evaluations or run-time are reported, it still appears to be faster, but the difference is not as great.

The results of performance comparisons with back-propagation seem to be task dependent [e.g.[377], [ 9]. In some cases, conjugate gradient is much faster; in others, it seems to have no advantage, or is slower. In [239] conjugate gradient worked well on simple problems, but simply converged (quickly) to a poor local minimum on a more difficult problem. Alpsan et al. [9] suggest that the difference depends on whether the problem is function approximation or classification. It is suggested that conjugate gradient (and other advanced methods) outperform back-propagation on function approximation problems where it is necessary to reduce the error to small values and that back-propagation generally does better on classification problems where training is stopped as soon as all patterns are correctly classified within a loose tolerance band. In [9], on a single network training problem, the time for conjugate gradient descent to correctly classify all patterns was similar to that of back-propagation without momentum and only half the speed of back-propagation with momentum. Generalization was poor.

An explanation for these results is that for very nonlinear functions, the quadratic approximation is not valid at a large scale so the conjugate gradient method has no advantage over simpler methods in initial stages of search (and its exact line searches may be superfluous). The quadratic approximation is usually valid, however, in a small region around an optimum, in which case conjugate gradient converges very quickly whereas gradient descent may actually slow down. For classification problems, the relaxed error criteria may be satisfied over a large region around a minimum so the search may end before the quadratic approximation becomes valid and the advantages of conjugate gradient come into play. For continuous regression problems, however, the region satisfying the error criteria is often much smaller so the search continues into a phase where conjugate gradient is by far superior.

Conjugate gradient descent was compared with (among other methods) back-propagation with an adaptive stepsize by Barnard and Holm [20]. The conjugate gradient method had lower errors and better generalization in early stages, but an adaptive algorithm achieved the same results after more iterations.

Chapter 10 - Classical Optimization Techniques

10.5 First-Order Gradient Methods

10.5.1 Gradient Descent

10.5.2 Best-Step Steepest Descent (Cauchy's Method)

10.5.3 Conjugate Gradient Descent