Many of the methods listed here are adaptive learning rate schemes. As noted in section 6.1, the often recommended learning rate of η = 0.1 is a somewhat arbitrary value that may be completely inappropriate for a given problem. For one thing, the magnitude of the gradient depends on how the targets are scaled; for example, the average error will tend to be higher in a network with linear output nodes and targets in a (-1000,1000) range than in a network with sigmoid output nodes and targets in (0,1). Also, when sum-of-squares error is used rather than mean squared error, the size of the error and thus the best learning rate may depend on the size of the training set . The effective learning rate is amplified by redundancies such as near duplication of training patterns and correlation between different elements of the same pattern, and by internal redundancies such as correlations between hidden unit activities. The latter depend in part on the size and configuration of the network but change as the network learns so different learning rates may be appropriate in different parts of the network and the best values may change as learning progresses.
Given the difficulty of choosing a good learning rate a priori, it makes sense to start with a "safe" value (i.e., small) and adjust it depending on system behavior. Some methods adjust a single global learning rate while others assign different learning rates for each unit or each weight. Methods vary, but the general idea is to increase the step size when the error is decreasing consistently and decrease it when significant error increases occur (small increases may be tolerated).
In general, some care is needed to avoid instability. The best step size depends on the problem and local characteristics of the E(w) surface (Chapter 8). Values that work well for some problems and some regions of the error space may not work well for others. It has been noted that neural networks often have error surfaces with many flat areas separated by steep cliffs. This is especially true for classification problems with small numbers of samples. As in driving a car, different speeds are reasonable in different conditions. A large step size is desirable to accelerate progress across the smooth, flat regions of the error surface while a small step size is necessary to avoid loss of control at the cliffs. If the step size is not reduced quickly when the system enters a sensitive region, the result could be a huge weight change that throws the network into a completely different region basically at random. Besides causing problems such as paralysis due to saturation of the sigmoid nonlinearities, this has the undesirable effect of essentially discarding previous learning and starting over somewhere else.