The genetic algorithm is a general purpose optimization algorithm with applications beyond neural network design. It can be applied to network design in a number of ways—from simply determining a few weights in a predetermined network to choosing the entire architecture: the number of layers and nodes, connections, weight values and node functions. Because it does not need gradient information, it can be used on networks with binary units and/or quantized weights.
Several things should be considered in applying the algorithm to neural networks. One of the more important factors is the representation—how problem parameters are encoded in the bit string. Neural networks often have many weights so the parameter string may be long. Because crossover breaks the string, it is desirable to put related parameters near each other in the bit string as much possible. Network weights tend to be strongly interrelated, however, with values of weights in one layer depending on the values of many weights in other layers. This means that the useful schemata often have high order and are easily broken by mutation and crossover. This interdependence leads some  to eliminate the crossover operation; this is not typical however.
The following paragraphs describe some applications of the algorithm to neural network design.
Training and Evaluation Some implementations  include a small amount of training in the fitness evaluation function. The idea is that one set of weights, A, might be worse that another set B, but set A might be much better than B after a small amount of training because of differences in the local terrain of the error space. In this case, unit A is closer to the solution than B even though it is initial performance is worse. Without training, the algorithm learns only the fitness of the initial point in the space. With training, each unit attempts to find the best spot reachable from its initial position; its fitness reflects the quality of a small area around the initial point so the algorithm searches more of the parameter space.
As in nature, the fitness of each unit is evaluated based on its performance after adaptation, but reproduction transmits the original parameters rather than the adapted parameters. The performance of a unit (in a sense) reflects the nearness of good solutions, and reproduction will tend to produce more offspring near good units, so the algorithm will tend to converge to good solutions.
Only a small amount of training is allowed, typically 5 to 10% of what would be required to train a random net to completion. If all units were trained to convergence, they could converge to the same or equivalent solutions and all units would have the same reproductive fitness regardless of the quality of their genes. In , it was found useful to allow each unit a different number of training cycles. Over a number of generations, this effectively measures the sensitivity of the gene to learning and rewards units which improve quickly with a small amounts of additional learning.
Subpopulations In nature, isolation of subpopulations is one factor that contributes to the development of new species. The use of several subpopulations with limited mixing allows aggressive optimization within each subpopulation while preserving diversity in the total population . This is said to be more effective in preserving diversity than simply increasing the population size. One-at-a-time reproduction is used with fitness based on rank. The offspring do not replace their parents; they replace the worst units so good solutions are certain to survive.
GA Pruning of Neural Networks The genetic algorithm can also be used to prune neural networks . Typically, the parameter string contains a bit for each weight in the original network; the bit is 1 if the connection is retained and 0 if it is pruned. The result of pruning is networks that are smaller, learn faster, and may generalize better.
Genetic Programming A related algorithm is genetic programming , , . Koza and Rice  describe an algorithm to determine both the weights and connection architecture of a neural net. The network response function is encoded in a tree-structured LISP expression (figure 11.4) and the crossover operation exchanges subtrees between two parents. The major use of this representation has been outside of neural networks. It has been used for evolutionary adaptation of functions or programs whose statements have syntactic structure; it is useful because the subtree crossover operation preserves syntactic correctness.
A difficulty with application of the standard genetic algorithm to neural network optimization is the problem of incompatible genomes. In general, two successful individuals do not always yield a successful offspring when they mate; their bit-strings might represent points on two different local maxima and the combination might fall in a valley between them, for example. In a neural network with H hidden units, there are H! equivalent solutions obtainable by shuffling the order of the units in the hidden layer. There are another 2H equivalent solutions obtainable by changing the signs of all weights into and out of any combination of hidden units (which leaves their function unchanged). Thus, two networks might compute identical input-output functions using different internal representations in which case the network obtained by mixing their weights would be very different from either and probably a poor solution. Neural networks also tend to be underconstrained; there are often many different input-output functions that satisfy the objective function equally well so equally successful networks may not even compute a similar input-output function.
The effect is that little progress is made until a significant fraction of the population compute similar functions with similar internal representations. Once a cluster of compatible networks develops, they have a higher probability of mating successfully and may grow to dominate the population at the expense of possibly better isolated solutions. The system then does local hill-climbing and is unlikely to explore very different solutions. This produces a strong tendency to converge to local maxima since the particular solution that comes to dominate initially is a nearly random selection. In theory, this doesn't have to happen, but it's likely if parameters are chosen for fast convergence (e.g., small populations and aggressive culling). Variations have been proposed to avoid these problems but they complicate the algorithm.