Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

8.3 Weight-Space Symmetries

Consider a network with two or more nodes in a hidden layer. The network output is unchanged when all weights into and out of two hidden nodes, i and j, are swapped; node i computes what node j used to and vice versa so the effect on the rest of the net is unchanged. Equivalently, the node indexes could just be swapped or the locations in the layer could be exchanged. There are H! permutations for the position of H nodes in a hidden layer, so this gives H! different weight vectors that produce equivalent inputoutput network functions. An immediate consequence of this is that the error surface will not have a single global minimum (unless it is the zero vector); there will be many points with equally small errors.

Another symmetry results because the tanh function is odd, f(-x) = -f(x). An equivalent response can be obtained by changing the sign of all the weights into and out of a hidden node since changing the sign of the input weights simply changes the sign of the node output and the effect on the following layer can be compensated by also changing the sign of the output weights. Similar symmetries exist for networks of sigmoid nodes if the biases of nodes in the following layer are adjusted when the signs are flipped because sigmoid(x) = 1 - sigmoid(-x).

Any combination of the hidden nodes may have their signs flipped, so there are 2^H possibilities.

Together, these give M = 2^H H! different weight vectors that produce identical inputoutput functions. For every weight vector that produces a particular input-output function, there are at least 2^HH! - 1 "twins" that produce equivalent responses. H does not have to be large for this to be a huge number, e.g., for H = 10,M = 3.7 billion.

For networks with more than one hidden layer, the number of symmetries is a product of similar terms for each layer [78]

(8.1)

wherel indexes the hidden layers and H_l is the number of nodes in layer l.

Hecht-Nielsen [162] asked whether these symmetries exhaust the possibilities. Sussmann [362] showed that, aside from these symmetries, the weights of a feedforward singlehidden-layer network with tanh nodes are uniquely determined by the input-output map, provided that the network is irreducible (i.e., that no nodes can be removed without affecting the output). The results have been extended to reducible networks with more general node nonlinearities [232].

Hecht-Neilsen [163] showed that these symmetries give the weight space a structure of cone or wedge-shaped regions that differ only by symmetry. The cones are otherwise identical so each contains weight vectors for every input-output function the network can implement. In principle, a training system could restrict search to a single cone and still cover all possible input-output functions. Because M can be very large, this could reduce the size of the search space by a huge amount. Unfortunately, the remaining space is still huge. There might be some benefit for nonlocal methods such as the genetic algorithm as this would limit redundancy in the search. (An empirical test using a simulated annealing method on the 2-input XOR problem showed a reduction of search time by about 1/2 [198].) For local search (e.g., gradient) techniques, however, there is no good reason to stay inside a single cone because, after all, the cones are identical. It might also seem counterproductive because the introduction of the cone boundary as a hard constraint could give rise to additional poor local minima at the boundary. The cone boundaries are natural divisions, however, because of symmetry so pure gradient descent naturally stays in its starting cone [164] and there is no need for special measures to restrict the weight vector to a single cone.

Chapter 8 - The Error Surface

8.3 Weight-Space Symmetries