Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

10.7 Stochastic Evaluation-Only Methods

A problem with deterministic gradient-based methods is that they can get trapped in local minima. In some cases, a few repetitions of the algorithm with different starting points may be enough to find the global minimum, but in other cases, for example, when there are many poor local minima or the basin of attraction of the global minimum is small, more powerful methods may be needed.

The advantage of stochastic methods is that every state has a nonzero chance of occurrence so if the procedure runs long enough the global minimum is likely to be found eventually. (This in itself is not special; pure random search will also find the global minimum if given enough time, but no one recommends it.) Under certain conditions, methods like simulated annealing and genetic algorithms can guarantee convergence (in probability) to the global minimum. This may take a very long time, however, and the guarantee is lost if the search is terminated early.

As evaluation-only methods, both simulated annealing and the genetic algorithm have the advantage of being relatively easy to implement in the sense that the algorithm is uncomplicated and there are no complex matrix manipulations. They need very little problem-specific information. They do not require a gradient vector so they may be used on discontinuous functions or functions described empirically rather than analytically. Under certain conditions, they will tolerate a noisy evaluation function.

10.7.1 Simulated Annealing

Simulated annealing [211], [264] is a general optimization technique based on a physical analogy. A physical system will generally settle to the lowest accessible energy state, but random thermal agitation will sometimes excite it to higher energy states. This is undesirable if we want the system to settle to the lowest possible energy, but occasionally it helps by giving the system enough energy to overcome a barrier separating it from even lower states.

The example of a liquid freezing into a solid is often used for illustration. A liquid is a disordered system with high energy. As energy is withdrawn, the liquid cools and begins to freeze. If the liquid is cooled very quickly, it tends to freeze into a disordered mass of tiny crystals with many imperfections and dislocations between grain boundaries. These imperfections are sites of internal stress that have high energy. If the system is cooled very slowly on the other hand, it tends to form large well-ordered crystals with few imperfections and low internal energy.

By cooling the system slowly, we allow many opportunities for random thermal agitation to rearrange atoms into more stable low-energy configurations. Because these states are more stable, they tend to survive longer. Although the probability of entering a particular stable state by random chance may be low, the probability of leaving is even lower so the system as a whole thus tends to a more ordered, lower energy state.

According to the Boltzmann statistics for a system in thermal equilibrium at temperature T, the probability of a state s with energy E_s is

(10.20)

where Z is a normalization constant and k is Boltzmann's constant. At very high temperatures, the exponent is small and all states are almost equally likely. At intermediate temperatures, there is a strong bias toward lower energy states, but higher energy states still have significant probability. At very low temperatures, low energy states are much more likely than higher energy states.

For optimization, an analogy is made between the system energy and the error function; parameter vectors with low errors correspond to system states with low energy. New candidate vectors are generated by modifying the current vector by some random process, for example, by adding noise to the current search point; different noise distributions give different convergence properties. If the new point has a lower error (energy), it is accepted as the new search point. If the new point increases the error (energy) by an amount ΔE, it is accepted with probability

That is, there is a chance P that the higher-error vector will be accepted as the new search point and a chance (1-P) that it will be rejected, in which case new candidates will be generated from the original point. At high temperatures, the exponent is very small and almost all proposed changes are accepted. At low temperatures, the probability decreases very quickly as ΔE increases; changes that increase the error have very low probability of being accepted and the trajectory approaches gradient descent.

The advantage of stochastic algorithms like simulated annealing is that every state has a nonzero chance of occurrence so if the process continues long enough, every state, including the global minimum, will be visited eventually. A purely random search would immediately hop to another state after visiting the global minimum, but simulated annealing has a bias to lower energy states so it is likely to remain near or revisit the global minimum often.

The promise of the method is that if the system is cooled slowly enough, it will eventually converge to the global minimum (with probability 1). The catch is that this may take a very very long time and all guarantees are lost if the system is cooled too quickly (quenched) or stopped too soon. It has been shown that a temperature reduction schedule inversely proportional to the logarithm of time will guarantee convergence (in probability) to a global minimum [135]

(10.21)

Because the logarithm increases slowly with t, this can take a very long time. In practice, this schedule takes too long and it is often more efficient to repeat the algorithm a number of times using a faster schedule (sacrificing claims of provable convergence in the process).

10.7.2 Genetic Algorithms

The genetic algorithm [173], [141] is a general optimization method based on an analogy to the evolution of species in nature. (Chapter 11 describes the algorithm in more detail; the following remarks outline the main points.) The idea is that individuals in a large population have varying traits which affect their reproductive success in the given environment. Successful individuals live long enough to mate and pass their traits to the next generation. Offspring inherit traits from successful parents so they also have a good chance of being successful. Over many generations, the population adapts to its environment; disadvantageous traits become rare and the average fitness tends to increase over time.

There are a number of evolutionary algorithms based on similar ideas. The genetic algorithm includes effects of mating—combination of traits from two successful parents to yield offspring that are similar to, but slightly different from, either parent. There is an element of randomness so it has stochastic properties similar to simulated annealing, but it acts on an entire population rather than a single individual.

For optimization purposes, individuals are represented by a bit-string that encodes the parameters of a particular solution and their fitness depends on the objective function. Offspring are generated from two parents by combining their bit-strings in various ways.

The principal advantage of the method is that it needs very little problem-specific information—just a function to evaluate parameter sets and assign fitness scores. In particular, it does not require gradient information and so may be used on discontinuous functions or functions that are described empirically rather than analytically. The mating selection and crossover operations are already somewhat random so, assuming appropriate parameters, the algorithm will tolerate moderate amounts of noise in the evaluation function. The crossover operation is nonlinear so the algorithm is not necessarily a hill-climbing method and is not particularly bothered by local maxima (minima). It is also easily parallelizable.

The principle disadvantage of the method is the amount of processing needed to evaluate a large population of candidates and converge to a solution over many generations. There are claims of convergence to a global maximum of the fitness function, but with small populations and aggressive culling of less successful solutions, convergence to a local maximum is possible. (This is similar to the case of cooling too quickly in simulated annealing.) Also, there are many parameters to be selected (population size, mutation rate, crossover method, fitness scaling, etc.) and it is not obvious how these affect the convergence properties. At this point, it is unclear if the algorithm is better than other methods such as simulated annealing.

A difficulty with the method for neural network optimization is the problem of incompatible genomes mentioned in section 11.3.1. The problem is that two successful individuals do not always yield a successful offspring when they mate; their bit-strings might represent points on two different local maxima and the combination might fall in a valley between them, for example. Peculiarities of the typical neural network structure aggravate this problem. Variations have been proposed to avoid the problem but they complicate the algorithm and detract from its advantage of simplicity.

Chapter 10 - Classical Optimization Techniques

10.7 Stochastic Evaluation-Only Methods

10.7.1 Simulated Annealing

10.7.2 Genetic Algorithms