View Table of ContentsCreate a BookmarkAdd Book to My BookshelfPurchase This Book Online
Skip to Book Content
Book cover image

Chapter 8 - The Error Surface

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology

Previous Section Next Section

8.7 Gain Scaling

The typical node function can be written

where f() is the node nonlinearity and βi is a gain parameter which controls the steepness of the function at 0. For sigmoid nonlinearities,

and at a = 0 the slope is β/4. Normally β = 1. Larger values increase the slope at 0 and narrow the width of the semilinear transition region. As β→∞, the response approaches a step function.

A number of studies, [196], [369] for example, have shown that every network with nonunity gains can be transformed into an equivalent network with unity gains by appropriate scaling of the weights (table 8.1). Further, if learning rates are also scaled appropriately, both networks will follow equivalent trajectories during training and produce equivalent outputs at the end of training.

Gain Control for Faster Learning In many cases, the motivation for gain scaling is to accelerate the training process. Izui and Pentland [190] show that convergence time scales like 1/β without momentum and like

with momentum.

Lee and Bien [237] include parameters for the slope, magnitude, and vertical offset of the sigmoid function

Table 8.1: Relationship of Node Gain, Learning Rate, and Weight Magnitude (from [369]).
 

with gain β

without gain

Node function

φ(βx)

φ(x)

Gain

β

1

Learning rate

η

β2η

Weights

w

βw

yj = K/(1 + e-βaj) - L.

Here the gain is a fixed nonunity value. In empirical tests [9], the changes had weak effects. For 0.4 β 1.2, learning speed and generalization increased with β, but for β > 1.2, learning became unstable "suddenly and severely" with few trials converging.

Several studies [354], [366], [312], [84] attempt to optimize the gain during training, most using gradient descent on the error. Most claim increased convergence speed and fewer problems with convergence to poor local minima. As noted, gain changes are equivalent to learning rate changes in a network without gains so optimization of gains has effects like an adaptive learning rate method. A gain change δβ is equivalent to a learning rate change from β2η to (β + δβ)2η and a weight change from βw to (β + δβ)w [196], [369].

Gain Control to Prevent Sigmoid Saturation Many weight initialization heuristics involve choosing an appropriate range for the initial random weights (see chapter 07). The equivalence between scaling the weights by a constant factor and introducing a gain term in the sigmoid function means that similar results can be obtained by gain scaling. In [240], [241] initial gains are chosen to avoid sigmoid saturation and its detrimental effects on learning time.

In [411], the gain is adjusted during training to prevent sigmoid saturation. If, during training, the errors are large but the back-propagated deltas are small then all the node gains are halved and the iteration repeated.

Gain Control for Improved Generalization Gain scaling has been suggested as a way to improve generalization. In most cases, the idea is to start with small gains that increase gradually during training. This is said to be related to "continuation" or "homotopy" methods in numerical analysis. The intent is to force the system to fit large-scale features of the target function first by making it harder to fit small-scale details. The small initial gains make the network compute a smoother function than it otherwise would with the same weights and larger gains. Later, once large-scale features are learned, the gain is increased to let the system fit smaller features. The hope is that by forcing the system to start with a smooth fit and then gradually increasing its flexibility, this will increase the chance of convergence to the global minimum.

Kruschke [228], [229], [230] describes a pruning procedure based on gain-competition (section 13.4.2). Sperduti and Starita [354] describe a similar pruning method in conjunction with the use of gain scaling for faster training.

Gain Scaling to Train Networks of Hard-Limiters In electronic circuit implementations, it is often desirable to use hard-limiting step functions for the node nonlinearity because they can be implemented with a simple switch. One way to train such networks is to gradually shift from a sigmoid to a step function during learning. (Training must be done in off-line simulations if the hardware can't implement the sigmoid.) A linear combination of the two functions

is used in [373]. Here a is the weighted-sum into the node, f(a) is the sigmoid function, h(a) is a step function, and λ changes from 1 to 0 linearly. A possible problem with this approach is that the g(a) is still nondifferentiable at a = 0. Selection of the adjustment schedule for λ is another problem. Yu et al. [412] adjust the sigmoid gain instead, setting β = 0.5e-SSE where SSE is the sum-of-squares error. Initially, when the error is large, the gain is small; later the gain increases as the error decreases. Corwin, Logar, and Oldham [86] and Yu, Loh, and Miller [412] also use gain adjustment to train networks of hard-limiters.


Top of current section
Previous Section Next Section
Books24x7.com, Inc. © 1999-2001  –  Feedback