Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

10.6 Second-Order Gradient Methods

Gradient methods using second-derivatives (the Hessian matrix) can be very efficient under certain conditions. Where first-order methods use a linear local approximation of the error surface, second-order methods use a quadratic approximation. If the function really is quadratic then a solution can be found very quickly (one step in Newton's method). The approximation is often good in the vicinity of a minimum in which case final convergence to the endpoint can be very fast. The approximation is often poor on a large-scale, however, so other methods may be better for initially finding the general neighborhood. The Levenberg-Marquardt method effectively switches from gradient descent to Newton's method as a minimum is approached.

10.6.1 Newton's Method

For smooth functions, Newton's method is the theoretical standard by which other optimization methods are judged. Because it uses all the first and second order derivative information in its exact form, its local convergence properties are excellent. Unfortunately though, it is often impractical because explicit calculation of the full Hessian matrix can be very expensive in large problems. Many 'second-order' methods therefore use approximations built up from first order information. Some methods for calculating and approximating second derivatives in neural networks are reviewed by Buntine and Weigend [62].

Newton's method is based on a quadratic model of the error function

(10.8)

where g =

is the gradient vector, and H is the Hessian matrix of second derivatives with elements

H is assumed to be positive definite. This is simply a Taylor series approximation truncated after the second order terms. Taking the derivative of ê(w) with respect to w and setting it equal to 0 gives

(10.9)

which has solution w* = -H^-1g. If H is positive definite and E(w) really is quadratic, the solution could be obtained in a single step (figure 10.6). Usually, of course, E is not exactly quadratic so iteration is necessary

Figure 10.6: For quadratic error functions, Newton's method converges in a single step. For more general nonlinear functions, Newton's method converges very quickly where the Hessian matrix is positive definite (e.g., in the vicinity of a minimum) but may diverge elsewhere.

(10.10)

where η is a step-size parameter (η = 1 normally).

The advantage of Newton's method is that it converges very quickly in the vicinity of a minimum. Convergence is quadratic for quadratic error functions; that is, in the limit |e_k+1| ≤ C|e_k|² where e = w - w*. Convergence requires that H be positive definite and 0 < η < 2/ λ _max where λ_max is the largest eigenvalue of H. In any case, η should be small enough to stay in the region where the quadratic approximation is valid.

A problem with the method is that it converges only where the Hessian is positive definite. (A symmetric matrix H is positive-definite if x^THx > 0 for all x≠0. If the result is 0 for some x0, H is positive-semi-definite. If the sign is positive or negative depending on x, H is indefinite. All eigenvalues of a positive-definite matrix are nonzero and positive. Some eigenvalues of a positive-semi-definite matrix are zero. An indefinite matrix has eigenvalues of both signs.)

H could easily be indefinite in general nonlinear functions, especially far from a minimum, and it is common for it to be badly conditioned (nearly singular) in neural networks, especially if the number of training samples is smaller than the number of weights. Section 8.6 summarizes some properties of the Hessian in typical neural networks. When H has one or more negative eigenvalues, E has negative curvature in some directions, which would suggest that arbitrarily large steps would reduce the error by arbitrarily large amounts. These would likely move the system out of the region where its model is valid and, if really large, could lead to saturation of the sigmoid node functions. H might then be rank deficient if saturation makes enough second derivatives small relative to machine precision [28].

Modifications of the pure algorithm are necessary to prevent divergence in such cases. Ideas include limitations on the maximum step length, the use of line searches along the Newton direction when the step fails to decrease the error, and the use of matrix decompositions to solve (10.9).

Another problem, purely practical, is the need to evaluate, store, and invert the Hessian matrix at each iteration. In practice, dozens of iterations may be needed. Storage of the matrix requires O (N²) space and could become a problem on small computers and work stations as N exceeds 1000–2000. Exact evaluation of the matrix at a single point requires computations equivalent to approximately O (N²) epochs in a network with N weights. A simpler method might be able to make great gains with those N² function evaluations, possibly even solving the problem before Newton's method can take a single step. Finally, solution of (10.9) requires approximately O (N³) operations whether the matrix is actually inverted or not. (These are simple operations like addition and multiplication, however, rather than complex operations like evaluation of the network error function.) The N²evaluations will probably take more time for most neural networks when the training set is large because evaluation of each pattern will take O (N) time and it can be argued that the number of training patterns should be at least O (N) to ensure good generalization, in which case exact evaluation of the matrix could scale like O(N⁴).

The problem is illustrated in [38], where a modern nonlinear least squares optimization program (NL2SOL, ACM Algorithm 573) using the exact Hessian, generally achieved smaller errors than back-propagation, but training times were an order of magnitude longer (39 hours to do 50 iterations versus 2.2 hours to do 500,000 back-propagation iterations); the paper describes a hybrid quasi-Newton alternative.

10.6.2 Gauss-Newton

Calculation of the exact Hessian matrix can be expensive, so approximations are often used. The Gauss-Newton and Levenberg-Marquardt techniques take advantage of special structure in the Hessian for least squares problems and use the outer-product approximation of section 8.6.3. As noted in section 8.6.3, the Hessian can written as

(10.11)

where P = gg^T is the average outer-product of first-order gradient terms while Q, q_ij = contains second order terms. Because P is the sum of outer products of the gradient vector, it is real, symmetric, and thus nonnegative-definite. Gauss-Newton is based on the assumption that the first-order terms dominate the second order terms near a solution. Of course, this assumption is not always valid, but the approximation seems to work reasonably well, especially when the larger algorithm does not depend too heavily on its accuracy.

The Gauss-Newton weight update is a solution of

(10.12)

This could be solved by inverting P, although other methods have practical advantages. Convergence is eventually quadratic once Q vanishes leaving P as an accurate approximation of H. This avoids the need for a costly exact evaluation of H, but it still requires storage of the matrix and solution of a matrix equation.

Figure 10.7: The Levenberg-Marquardt method is a compromise between Newton's method, which converges very quickly in the vicinity of a minimum but may diverge elsewhere, and gradient descent, which converges everywhere, albeit slowly. In general,the trajectory starts out like gradient descent but gradually approaches the Newton direction.

10.6.3 The Levenberg-Marquardt Method

A problem with Newton's method is that it converges only where H is positive definite, but H could easily be indefinite for a general nonlinear function, especially far from a minimum. The Levenberg-Marquardt method (figure 10.7) is a compromise between Newton's method, which converges quickly near a minimum but may diverge else-where, and gradient descent, which converges everywhere, though slowly. The search direction is a linear combination of the steepest descent direction g and the Newton direction H^-1g

(10.13)

Parameter λ controls the compromise. This can be viewed as forcing H + λI to be positive definite by adding a scaled identity matrix. The minimum value of λ needed to achieve this depends on the eigenvalues of H. The algorithm starts with λ large and adjusts it dynamically so that every step decreases the error. Generally it is held near the smallest value that causes the error to decrease. In the early stages when λ is large, the system effectively does gradient descent. In later stages, λ approaches O, effectively switching to Newton's method for final convergence.

This avoids the divergence problem of Newton's method and the need for a line search, but still calls for storage and inversion of an N × N matrix. In the preceding description, H could be the true Hessian, but in practice the outer-product approximation of section 8.6.3 is usually used so Levenberg-Marquardt is commonly considered to be a first-order method applicable only to least squares problems.

The Levenberg-Marquardt method is compared to back-propagation and conjugate gradient descent by Hagan and Menhaj [148]. A variation combined with adaptive stepsize heuristics is described by Kollias and Anastassiou [218]. A detailed discussion and source code are provided in [258]. Results seem to be good on moderately sized problems.

10.6.4 Quasi-Newton Methods

Quasi-Newton methods, sometimes called variable metric methods, build an approximation of the inverse Hessian matrix iteratively using only first-order gradient information. This removes the need to explicitly calculate and invert the Hessian, but not the need to store the approximation. The two major variations are the Davidon-Fletcher-Powell (DFP) and the Broyden-Fletcher-Goldfarb-Shanno (BFGS) methods. They differ in details, but BFGS is generally recommended.

From [138], the BFGS update for the approximation B_k at step k + 1 is

(10.14)

where S_k is the step taken and g_k is the gradient vector at iteration k, and y_k = g_k+1 - g_k.

When the search direction p_k is computed by solving

(10.15)

and S_k = α_kp_k is the step taken after doing a line search along p_k, this simplifies to

(10.16)

The initial approximation B_O, is usually the identity matrix so the first step is an iteration of best-step steepest-descent.

It is apparent that the formula preserves symmetry. For stability, it is critical that it also preserve positive-definiteness. Some care must be taken in the implementation because loss of positive-definiteness is possible due to limited-precision line searches and numerical round-off errors. Exact line searches are not necessary for convergence, however, and the method often converges faster (in terms of the number of function evaluations) using inexact line searches.

Storage requirements may make the method impractical for large networks, but it may be worth considering for small to moderate size networks. There do not seem to be major performance advantages compared to conjugate gradient methods, however. In [19], BFGS and conjugate gradient with restarts gave similar average errors on three neural network test problems and had similar convergence speed. On a neural network function approximation problem in [103], BFGS was able to achieve lower error than back-propagation, conjugate gradient, and simplex search; only Levenberg-Marquardt achieved a lower error. In [146], BFGS was the fastest and most reliable method in a comparison with conjugate gradient and an adaptive stepsize version of back-propagation on a small problem. Quasi-Newton training is specifically addressed in [2].

Although quasi-Newton methods have fast local convergence, this does not guarantee convergence to a good minimum. In [377], the DFP version converged to good solutions in only a third of 10,000 trials. In [9], DFP and BFGS gave basically the same results on a single real-world MLP training problem. Both located good minima in only half of the trials. In the other half they converged to local minima or suffered from numerical instability. Both were slower than standard back-propagation (with or without momentum) to learn the training set (classify correctly within error tolerances) and generalization was poor.

Limited Memory BFGS One-step limited memory BFGS [138] is a variation that avoids the need to store a matrix. A quasi-Newton update formula is used, butH^-1 is taken to be the identity matrix so only O(N) storage is needed. In implementation, it is similar to conjugate gradients.

The search direction at step k is

(10.17)

where A_k and B_k are scalar weighting factors

(10.18)

(10.19)

and g_k is the gradient, p_k = w_k - w_k-1, and y_k = g_k - g_k-1.

An advantage of the method over conjugate gradients is its tolerance of inexact line searches, which allows significant computational savings. In an MLP training test for a real-world problem, computation time was reduced by a factor of eight relative to conjugate gradient with exact line searches even though four times as many inexact line searches were required [29] (as reported by Alpson et al. [9]). Formula (10.17) is used in [27], and a variation is described in [28].

Chapter 10 - Classical Optimization Techniques

10.6 Second-Order Gradient Methods

10.6.1 Newton's Method

10.6.2 Gauss-Newton

10.6.3 The Levenberg-Marquardt Method

10.6.4 Quasi-Newton Methods