Skip to Book Content
Book cover image

Appendix A - Linear Regression

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

A.2 Gradient Descent

For large input dimensions (or small computers), storage and accurate inversion of R can be a problem so iterative procedures are often used. In simple gradient descent, R is replaced by I (actually, R = I only for zero-mean, unit-variance, uncorrelated inputs) and the weight vector moves directly down the local gradient

(A.12)

The step size η controls how much w changes in each iteration. Although this will eventually converge to the optimal solution, the time required may be very long as the gradient, and thus the step size, approaches zero at the minimum.

Stability requires 0 < η <2/λmax where λmax is the largest eigenvalue of R (see below). When η > 2/λmax, the system will diverge. Because λmax is usually unknown, a smaller than optimal step size must be used which may further slow convergence.

Convergence Rate of Gradient Descent Assuming R is full rank with distinct eigenvalues, it can be decomposed into R = VDVT where V is the matrix whose columns are eigenvectors of R and D is the diagonal matrix whose entries are the corresponding eigenvalues. Because the eigenvectors are orthonormal, V is unitary, VT = V-1. After changing coordinates z = VT(w - w*), equation A.9 can be written

(A.13)
(A.14)

Because D is a diagonal matrix, E is now the sum of N uncoupled components

(A.15)

where zk is the projection of w - w* on the kth eigenvector and λk is the corresponding eigenvalue. The gradient of E with respect to z becomes

(A.16)

which gives the gradient descent update rule

(A.17)

The components are decoupled, so we have N independent processes

(A.18)

and after m time steps

For zk(t + m) to approach 0 as m becomes large, it is necessary that |1 - ηλk| < 1, which requires λk > 0 and 0 < η < 2/λk for all k, that is,

(A.19)

with fastest convergence occurring at η = 1/λmax. Le Cun, Simard, and Pearlmutter [94] describe an iterative method for estimating λmax in a neural network; see section 6.1.7.

From equation A.18, we have

The continuous-time approximation

has solution

(A.20)

which shows the exponential nature of the convergence. The overall convergence is limited by the rate of convergence of the slowest component. Using the optimum η = 1/λmax gives

(A.21)

That is, the overall convergence is governed by the slowest time constant

: where λmin is the smallest nonzero eigenvalue.

There may be a loophole here, however. Using (A. 15), the overall error can then be expressed as

(A.22)

If we are satisfied when the error is small but nonzero, at 0.001 MSE say, then if zk(0) is small enough, the contribution of the kth component to the total error will be small and it will not be necessary to wait for these components to fully converge. Although small λk values cause slow convergence of the kth component, they also weight the contribution of the component to the total error.

An approximate expression for the distribution of eigenvalues in the case of random, uncorrelated inputs is given by Le Cun, Kanter, and Solla [92], [93] and leads to estimates of the learning time. The following points are made: Nonzero-mean inputs, correlations between inputs, and nonuniform input variances can all lead to increased spread between the minimum and maximum eigenvalues. For nonzero-mean inputs, there is an eigenvalue proportional to N so λmax is much larger than in the zero-mean case and convergence time is slower. Subtracting the mean from the inputs suppresses the large eigenvalue and leads to faster training times. Nonuniform input variances can also lead to an increased spread in the eigenvalues which can be suppressed by rescaling the inputs. This result provides justification for centering and normalizing input variables.

For a multilayer network, the hidden layer outputs can be considered as inputs to the following layer. Because sigmoid outputs are always positive and therefore have a nonzero mean, while symmetric (odd) functions such as tanh are at least capable of a zero mean, this also provides justification for the empirical observation that use of tanh nonlinearities often produces faster training than sigmoid nodes. The result also provides justification for the suggestion of scaling the learning rate for each node by 1/M, where M is the number of inputs into the node (see section 6.1.9).