![]() ![]() ![]() ![]() |
| |||||
| |||||
|
![]() |
![]() |
Evaluation-only methods search for minima by comparing values of the objective function at different points. In particular, they do not use gradient information. The basic idea is to evaluate the function at points around the current search point, looking for lower values. Algorithms differ in how test points are generated.
Simplicity is the main advantage of evaluation-only methods. Inefficiency is their main disadvantage; for smooth functions it usually pays to use gradient information if it is available. For MLPs with continuous node nonlinearities, the gradient is easily calculated in the back-propagation step. In computer simulations, the incremental cost (in time and complexity) is approximately the same as the cost of a forward evaluation.
Evaluation-only methods are most useful where internal system details are inaccessible (as in some integrated circuit implementations), where the transfer function of the system is not differentiable, or in complex systems where derivatives are difficult to calculate correctly. In spite of their inefficiency, some evaluation-only methods do have global convergence properties lacking in gradient-based methods.
One of the simplest search methods is to take small steps along each coordinate direction separately, varying one parameter at a time and checking if the error decreases. If a step in one direction increases the error, then a step in the opposite direction should decrease it (if the step size is small enough). After N steps, each of the N coordinate directions will have been tested. If none of the steps decrease the error, the step size may be too big relative to the curvature of the error surface and the step size should be reduced.
Although this is simple to implement, it is very inefficient in high-dimensional spaces. For a neural network, each step would require evaluation of all the training patterns but change only a single weight. There is also a small chance of getting stuck at the bottom of an obliquely oriented 'ravine' in the error surface since the error decreases along the axis of the ravine but is higher along each of the coordinate directions so no more steps will be taken. Usually, however, the point is slightly off-center so the step size will merely reduce to a very small value resulting in slow convergence.
The Hooke-Jeeves pattern search method [311] greatly accelerates convergence in this situation by remembering previous steps and attempting new steps in the same direction. An exploratory move consists of a step in each of the N coordinate directions ending up at the base point after N steps. A pattern move consists of a step along the line from the previous base point to the new one. This may be oblique, in general. This becomes a temporary base point for a new exploratory move. If the exploratory move results in a lower error than the previous base point, it becomes the new base point. If it does not decrease the error, then the temporary base point is discarded and an exploratory move is done around the current base point. If this exploratory search fails, the step size is reduced. The search is halted when the step size becomes sufficiently small.
There are many heuristic variations, but simplicity is the main advantage of the method and once it's lost, more sophisticated methods are preferable.
Another direct search method is based on the idea of using a population of points to determine the local shape of the error surface. The basic steps are:
Create a simplex (a regular convex polytope) in N dimensions and evaluate the function at each of the N + 1 vertices.
Identify the vertex with the highest error.
Reflect the vertex across the centroid of the other vertices. This tends to be a step downhill but not parallel to the gradient, in general.
Evaluate the function at the new point. If the error is lower, go to 2. Otherwise, reject the new point and try reflecting less far across the centroid (i.e., move the reflected point in towards the centroid).
Go to 2.
There are additional rules to handle some special cases. Eventually, the points straddle a local minimum and the vertices converge to a single point as the size is reduced. One problem of the basic algorithm is that the simplex could become very small if it has to "squeeze through" a tight area; convergence would be very slow from then on. The Nelder-Mead variation introduces rules for expanding and contracting the simplex to accelerate convergence and close in on the final minimum.
This method is relatively simple to implement. It is said to be reasonably efficient and tolerant of noise in the objective function. (Powell's method may be more efficient for many problems, however, although it is not as easy to implement.) Like other evaluation-only methods, it can be used on nondifferentiable functions. It may not be suitable to large neural networks though because it requires storage of approximately N2 values (N + 1 vertices each specified by a vector of N weights).
In a neural network training example [103], simplex search was slow at first, but eventually reduced the error to a lower value than back-propagation. Quasi-Newton and Levenberg-Marquardt methods achieved even lower errors, however.
Powell's conjugate direction method, sometimes called the direction set method, uses information from previous steps in order to choose the next search direction. A quadratic error function E = xTHx is assumed, where x are the parameters to be optimized.
A set of vectors d1, d2,…dr, r ≤ N, are mutually conjugate with respect to a symmetric N × N matrix H if they are linearly independent and [311]
Conjugate directions are useful because an N-dimensional quadratic function determined by H can be minimized by performing independent one-dimensional searches along any N mutually conjugate directions. The directions are uncoupled in the sense that minimizing along direction di does not undo any gains that were obtained by minimizing along previous directions. The reader is referred to an optimization text, for example, Reklaitis, Ravindras, and Ragsdell [311], for further details.
It would seem to be necessary to evaluate H, the Hessian matrix of second derivatives, in order to find a set of conjugate directions, but in fact it is not. Powell's method exploits the parallel subspace property of quadratic functions, defined as follows, to find a set of conjugate directions without evaluating the Hessian. Given a quadratic function xTHx, a direction d, and two points x1 and x2, if one does a line search from x1 along direction d to obtain a minimum y1 and another line search from x2 along direction d to obtain y2 then direction (y2 - y1) is H-conjugate to d [311] (see figure 10.1).
In N dimensions, Powell's method does N line searches to identify each conjugate direction using only function evaluations. If the error function actually is quadratic, it will find the minimum after N2 (exact) line searches. Of course, more searches will usually be needed for general nonlinear functions and inexact calculations, but convergence is still quadratic. It is said to be as reliable as, and usually much more efficient than, other direct search methods.
For neural networks in which the gradient is easily obtained, the conjugate gradient method is preferred since it is much more efficient (O(N) line searches for a quadratic function). In either case, for general nonlinear functions, the quadratic approximation is only reasonable near a minimum and other methods may be better to reach the general neighborhood in the first place.
|
![]() | ||
![]() |
![]() |
![]() | |
Books24x7.com, Inc. © 1999-2001 – Feedback |