![]() ![]() ![]() ![]() |
| |||||
| |||||
|
![]() |
![]() |
The Hessian, H, of the error with respect to the weights is the matrix of second derivatives with elements
Knowledge about the Hessian is useful for a number of reasons:
The convergence of many optimization algorithms is governed by characteristics of the Hessian matrix. In second-order optimization methods, the matrix is used explicitly to calculate search directions. In other cases, it may have an implicit role. Slow convergence of gradient descent, for example, can often be explained as an effect of an ill-conditioned Hessian (seeappendix A.2). The eigenvalues of H also determine how large the learning rate can be before learning becomes unstable. When H is known, an optimal rate can be chosen; in other cases, a more conservative choice must be made.
Some pruning algorithms use Hessian information to decide which weights to remove. "Optimal brain damage" uses a diagonal approximation while "Optimal brain surgeon" uses the full approximation.
The inverse Hessian can be used to calculate confidence intervals for the network outputs [44], [399].
Hessian information can be used to calculate regularization parameters [44], section 10.4].
Hessian information can be used for fast retraining of an existing network when additional training data becomes available [41].
At a local minimum of the error function, the Hessian will be positive definite. This provides a way to determine if learning has stopped because the network reached a true minimum or because it "ran out of gas" on a flat spot.
Bishop [44], section 4.10] and Buntine and Weigend [62] provide summaries of a number of methods for calculation and approximation of the Hessian in neural networks. (The preceding list is partly drawn from [44], section 4.10].)
Sometimes the Hessian is required only as a means to obtain the product Hv for some vector v. Pearlmutter [297] describes a fast method for finding this product which does not require evaluation of the Hessian.
An ill-conditioned matrix has a large ratio | λmax/λ min|, whereλmax is the largest eigen-value and λmin is the smallest (nonzero) eigenvalue. In the Hessian of the error with respect to the weights, this reflects the fact that the gradient changes slowly along one direction (determined by the eigenvector associated with λmin) and changes rapidly along another direction (determined by the eigenvector associated with λ max). The effect this has on gradient descent is discussed in section 10.5.1.Briefly, it requires that a small learning rate be used to avoid instability along the quickly changing direction, but then progress along the slowly changing direction will be slow and convergence times will be long. Appendix A.2 discusses the convergence rate of gradient descent in linear problems.
It appears that the Hessian is very often ill-conditioned in neural network training problems [37], [331], [92], [93] and it is common for many eigenvalues to be near-zero. Sigmoid saturation may cause effective loss of rank. Intuitively, small eigenvalues can be related to flat regions of the E(w) surface where the error changes very slowly in most directions. It is worth noting that the outer-product approximation (see the following) will be rank-deficient when the number of weights exceeds the number of training patterns. This is not necessarily recommended, but not uncommon in neural networks.
An implication of ill-conditioning is that the expected efficiencies of higher order optimization methods may not be realized [331]. Ill-conditioning could lead to numerical instability or very large steps that take the system out of the region where the local approximation is valid. This is a well-known problem with Newton's method, for example, which is addressed by standard methods. The fix used in the Levenberg-Marquardt method is to choose the search direction based on a combination of the gradient and Newton directions.
Many of the techniques suggested for accelerating back-propagation are simple algorithmic modifications that optimize certain steps of the procedure without addressing the fundamental problem of ill-conditioning. A more basic way to decrease training time is to modify the problem so that the Hessian is better conditioned. The effectiveness of some common techniques can be explained in terms of their effect on the Hessian:
normalization of inputs to zero-mean values with similar variances,
use of bias nodes (which remove the mean internally),
use of tanh nonlinearities rather than logistic sigmoids (because the outputs of tanh nodes tend to be zero-mean when their inputs are zero-mean whereas the outputs of sigmoid nodes always have a positive mean), and
preprocessing (e.g., principal components analysis) to remove input redundancy.
Bishop, [41], [39] and [44], section 4.10.5], describes an O(W2) method for calculating the exact Hessian of a feedforward network. The procedure is somewhat involved for a general feedforward network so the simplified version [44], section 4.10.6] for single-hidden-layer networks is summarized here. For simplicity, the terms are given for a single pattern p; the complete expression is obtained by summing contributions from each pattern. Let indexes i and i' denote input nodes, j and j' denote hidden nodes, and k and k' denote output nodes. Each nondiagonal term of the Hessian involves two weights.
If both weights are in the second layer,
where zj is the output of the jth node, δ kk' is the Kronecker delta, and Hkk is defined at the end of the section.
If both weights are in the first layer,
where xi is the ith input, aj is the weight sum into the jth hidden node, f() is the node nonlinearity function, and δk = ∂ Ep/∂ak is the delta term calculated by back-propagation.
If one weight is in each layer,
The term Hkk'
describes how errors in different output nodes interact to affect the overall error. For the sum-of-squares error function and linear output units, Hkk' = δkk'.
Second order optimization methods often require the Hessian matrix or an approximation. For mean-squared-error cost functions, the outer-product approximation is a common choice.
Outer-Product Approximation For the mean-square error function, the Hessian can often be approximated reasonably well by the average of outer-products of the gradient vectors. This approximation is used by the Gauss-Newton optimization method (section 10.6.2)
The cost function E is the mean squared error between the desired output d and the actual output y which is a function of the weights wk
The <>brackets denote the average over the training set. For simplicity, assume y is a scalar. A single weight index (e.g., wj) is used here since it is not necessary to identify the weight by the nodes it connects. The gradient g has components
and the Hessian H has components
This can be written
where P = <ggt> is the average outer-product of
first-order gradient terms while Q, contains second order terms.
Because P is the sum of outer products of the gradient vector,
it is real, symmetric, and thus nonnegative-definite (wTPw ≥ 0 for all ||w|| ≠
0). The number of training samples must be greater than the number of weights
for it to have full rank.
The outer-product approximation is based on the assumption that first-order terms dominate higher order terms near a minimum. This is reasonable when the residual errors (d-y) are zero-mean, independent, identically distributed values uncorrelated with the second derivatives ∂2y/(∂wi ∂wj). If this is true then, when the number of training points is large, Q should average to zero with a small variance and so can be ignored.
Of course, this assumption is not always valid. If there are too few training points, the variance of Q may be large. Also, if the network is too simple to fit the data, the errors may not be small or may be correlated with the second derivatives of y. (It can be argued that the errors and second derivatives will be correlated in many cases because an overly smooth network function will tend to "shave off the peaks" and "fill in the dips" of the target function. The function tends to peak where the target peaks and dip where it dips, but not quite enough, so at a peak of y the second derivatives are necessarily negative while the local errors d-y tend to be positive.) A second caution is that the approximation may be valid only near points in the training set used to calculate the approximation. In practice, however, the approximation seems to work reasonably well especially when the larger algorithm does not rely too heavily on its accuracy.
The Diagonal Approximation A further approximation is to assume that the Hessian is diagonal. This provides at least some of the Hessian information while avoiding the O(W2) storage and calculation costs demanded by the full approximation. In reality, H would be diagonal only if all weights affected the error independently, that is, only if there are no significant interactions between different weights. Most networks, however, have strong weight interactions so the Hessian usually has nonnegligible off-diagonal elements. The approximation therefore is not expected to be especially accurate; its main advantage is computational convenience. Effects of the approximation are discussed by Becker and LeCun [37]. A diagonal approximation is used in the "optimal brain damage" pruning method (section 13.2.3).
The second derivatives hkk can be calculated by a modified back-propagation rule in about the same amount of time as a single back-propagation epoch. From [91], the diagonal elements are
where ai is the weighted sum into node i and xj = f(aj) is the output of node j. Here the weights are indexed both by k and by the indexes i and j of the nodes they link. At the output nodes, the second derivatives are
(for all units i in the output layer). The second derivatives for internal nodes are obtained by a modified back-propagation rule
The f"(ai) terms are sometimes ignored in the last two equations. This corresponds to using the diagonal of the outer-product approximation and gives guaranteed positive estimates of the second derivatives [91].
Finite-Difference Approximation A finite-difference approximation of H is [44], pg. 154]
Weightwlk is perturbed first by +Î. and then by -Î. The first-order derivatives for all weights -wji are calculated in each case and the second derivative is approximated by the difference between the two first-order derivatives, scaled by 2Î. There are W weights to perturb and each gradient calculation takes O(W) time so the approximation take O(W2) time. The approximation errors are of size O(Î2); small Î values are desirable for accuracy, but larger values are desirable to avoid numerical problems.
Because of its simplicity, the finite-difference approximation is useful during debugging to verify the correctness of other evaluation methods.
|
![]() | ||
![]() |
![]() |
![]() | |
Books24x7.com, Inc. © 1999-2001 – Feedback |