Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

A.2 Gradient Descent

For large input dimensions (or small computers), storage and accurate inversion of R can be a problem so iterative procedures are often used. In simple gradient descent, R is replaced by I (actually, R = I only for zero-mean, unit-variance, uncorrelated inputs) and the weight vector moves directly down the local gradient

(A.12)

The step size η controls how much w changes in each iteration. Although this will eventually converge to the optimal solution, the time required may be very long as the gradient, and thus the step size, approaches zero at the minimum.

Stability requires 0 < η <2/λ_max where λ_max is the largest eigenvalue of R (see below). When η > 2/λ_max, the system will diverge. Because λ_max is usually unknown, a smaller than optimal step size must be used which may further slow convergence.

Convergence Rate of Gradient Descent Assuming R is full rank with distinct eigenvalues, it can be decomposed into R = VDV^T where V is the matrix whose columns are eigenvectors of R and D is the diagonal matrix whose entries are the corresponding eigenvalues. Because the eigenvectors are orthonormal, V is unitary, V^T = V^-1. After changing coordinates z = V^T(w - w^*), equation A.9 can be written

(A.13)

(A.14)

Because D is a diagonal matrix, E is now the sum of N uncoupled components

(A.15)

where z_k is the projection of w - w^* on the k^th eigenvector and λ_k is the corresponding eigenvalue. The gradient of E with respect to z becomes

(A.16)

which gives the gradient descent update rule

(A.17)

The components are decoupled, so we have N independent processes

(A.18)

and after m time steps

For z_k(t + m) to approach 0 as m becomes large, it is necessary that |1 - ηλ_k| < 1, which requires λ_k > 0 and 0 < η < 2/λ_k for all k, that is,

(A.19)

with fastest convergence occurring at η = 1/λ_max. Le Cun, Simard, and Pearlmutter [94] describe an iterative method for estimating λ_max in a neural network; see section 6.1.7.

From equation A.18, we have

The continuous-time approximation

has solution

(A.20)

which shows the exponential nature of the convergence. The overall convergence is limited by the rate of convergence of the slowest component. Using the optimum η = 1/λ_max gives

(A.21)

That is, the overall convergence is governed by the slowest time constant

: where λ_min is the smallest nonzero eigenvalue.

There may be a loophole here, however. Using (A. 15), the overall error can then be expressed as

(A.22)

If we are satisfied when the error is small but nonzero, at 0.001 MSE say, then if z_k(0) is small enough, the contribution of the kth component to the total error will be small and it will not be necessary to wait for these components to fully converge. Although small λ_k values cause slow convergence of the kth component, they also weight the contribution of the component to the total error.

An approximate expression for the distribution of eigenvalues in the case of random, uncorrelated inputs is given by Le Cun, Kanter, and Solla [92], [93] and leads to estimates of the learning time. The following points are made: Nonzero-mean inputs, correlations between inputs, and nonuniform input variances can all lead to increased spread between the minimum and maximum eigenvalues. For nonzero-mean inputs, there is an eigenvalue proportional to N so λ_max is much larger than in the zero-mean case and convergence time is slower. Subtracting the mean from the inputs suppresses the large eigenvalue and leads to faster training times. Nonuniform input variances can also lead to an increased spread in the eigenvalues which can be suppressed by rescaling the inputs. This result provides justification for centering and normalizing input variables.

For a multilayer network, the hidden layer outputs can be considered as inputs to the following layer. Because sigmoid outputs are always positive and therefore have a nonzero mean, while symmetric (odd) functions such as tanh are at least capable of a zero mean, this also provides justification for the empirical observation that use of tanh nonlinearities often produces faster training than sigmoid nodes. The result also provides justification for the suggestion of scaling the learning rate for each node by 1/M, where M is the number of inputs into the node (see section 6.1.9).

Appendix A - Linear Regression

A.2 Gradient Descent