### Chapter 7 - Weight-Initialization Techniques

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II

## 7.1 Random Initialization

The normal initialization procedure is to set weights to "small" random values. The randomness is intended to break symmetry while small weights are chosen to avoid immediate saturation.

Symmetry breaking is needed to make nodes compute different functions. If all nodes had identical weight vectors, then all nodes in a layer would respond identically and the layer would function as if it contained just one node. Likewise, each node would receive identical error information during back-propagation so weight changes would be identical and the weights would never have a chance to become different.

Small weights are needed to avoid immediate saturation because large weights could amplify a moderate input to produce an extremely large weighted sum at the inputs of the next layer. This would push the nodes well into the flat regions of their nonlinearities and learning would be very slow because of the small derivatives [241]. The weights should not be too small, however, because learning speed would then be limited by the small δ values back-propagated through the weights. Another factor is that the origin is a saddle point on many error surfaces.

Typically, weights are randomly selected from a range such as where N is the number of inputs (fan-in) to the node and A is a constant between 2 and 3. Division by the fan-in compensates for the increase in variance of the weighted-input sum with the number of inputs; without it, the sum could sometimes be large for large N and the node would saturate often. The range (-2.4/N,2.4N), where N is the number of node inputs, is another commonly cited choice [236].

Suppose the weights are selected from a range [-Wo,+Wo]. Many studies, for example [121], [235], [240], [393], [110], [241], [321], [368], have observed that an intermediate set of Wo values work best. Extreme values either do not converge or converge to poor solutions. Very small initial weights make it hard to escape from the w = O weight vector, which is a poor local minimum or saddle point in many problems. With large Wo values, on the other hand, many nodes saturate, derivatives are small, and the net does not converge to a good solution in a reasonable amount of time. Within the range of values that work, the exact value is usually not critical. Thimm and Fiesler [368] suggest that it is better to choose too small a value rather than one that is too big because performance deteriorates very quickly once the upper threshold is crossed.

### 7.1.1 Calculation of the Initial Weight Range

One basis for selecting an initial weight distribution is to assume the inputs have some statistical distribution and select the initial weight distribution so that the probability of saturating the node nonsnearity is small.

Assume that the weights are independent of the inputs

 (7.1)

and that the weights are zero-mean, independent, and identically distributed

 (7.2)
 (7.3)

where δij=1 if i =j and 0 otherwise.

The weighted sum into a node with N inputs is

 (7.4)

and by independence of w and x the expected value is

 (7.5)

Because the expected value is 0, the variance is then

Note that this does not require that the inputs be independent. Independence of wi and wj suppresses the effect of correlations between xi and xj on E [u2]. If the inputs are zeromean, identically distributed so E [x2i] = σ2x, then

 (7.6)

and

 (7.7)

For y=tanh(u) nonlinearities, the input u needed to produce an output y is

 (7.8)

Let us say the node is saturated for |y|>0.9 so usat=ln 19. For sigmoid nodes, the constant is the same if saturation is taken to occur at sigmoid(usat)=0.95.

We want the probability that u > usat to be small. This can be achieved by selecting the initial weight distribution so usat is several times σ u. With, for example,

 (7.9)

the probability that a given node will be saturated will be about 5%. This assumes a Gaussian distribution for u, which is reasonable when N is large because of the central-limit theorem.

Uniform Weights For weights initialized from a uniform distribution over the interval

 (7.10)
 (7.11)
 (7.12)
 (7.13)

For bipolar inputs xi Î {-1, +1} with equal probability for either value, σx=1. Then, for usat=ln 19

 (7.14)

For bipolar inputs with probability P [x=1]=p, and

 (7.15)

For binary inputs xiÎ {0,1} with probability P[x=1]= p, and

 (7.16)

and

 (7.17)

For uniform inputs in the range [-a, +a], and

 (7.18)

For Gaussian N(0, σx) inputs,

 (7.19)

Gaussian Weights Similarly, for weights initialized from a Gaussian N(0, σw) distribution,

 (7.20)
 (7.21)
 (7.22)

For usat=ln 19,

 (7.23)

For bipolar inputs xi Î{-1, +1} with equal probability for either value, σx=1 and

 (7.24)

For bipolar inputs with probability P [x=1]=p, and

 (7.25)

For binary inputs xi Î{0,1} with probability P[x=1] =p, and

 (7.26)

and

 (7.27)

For uniform inputs in the range [-a,+ a], and

 (7.28)

For multilayer networks, the inputs in the derivation above could actually be the outputs of a preceding layer. Because they have different statistical properties from the overall system inputs, each layer of weights will have a different ideal initialization range according to this approach. For large fan-ins, the weighted sum u into a node usually approaches a Gaussian distribution. If the initial weights into the node are chosen to avoid saturation, the distribution of the node outputs will also be approximately Gaussian but with a standard deviation multiplied by the slope s of the nonlinearity, σout=sσu. Similar adjustments are appropriate for nodes receiving weights from several different layers, as might happen in networks containing "leap-frog" weights.

It should be noted that the derivation depends on assumptions about the input distribution, which may not apply in a particular problem. In many problems, inputs will be clustered and the large fan-in assumption may not be valid for small networks. An alternative to relying on possibly invalid assumptions is to calculate an appropriate range numerically using the same basic procedure. The calculations are relatively simple and fast in most cases.

Table 7.1: Weight Initialization Parameters.

Input distribution

Bipolar{-1,1}

for p=1/2

Binary {0,1}

for p = 1/2

Uniform(-a,a)

Gaussian N(0,σx)

The objective of this initialization method is to minimize the probability that nodes will be saturated in the early stages of training. A potential problem, pointed out by Wessels and Barnard [393] (see section 7.1.4), is that this makes all nodes sensitive to all the training patterns so the decision boundaries of many nodes may move large distances before settling to a stable state. In some cases, hyperplanes may move completely out of the region occupied by the training data and produce stray nodes that contribute little useful information to the rest of the net. They suggest that occasional saturation helps to "pin the hyperplanes to the data". Because its derivatives are small when saturated, each node will be most sensitive to patterns near its hyperplane and relatively insensitive to more distant patterns. Initially at least, each node would be loosely specialized by sensitization to a different fraction of the data.

### 7.1.2 Initialization to Maximize BP Deltas

The derivation of section 7.1.1 provides criteria for selecting a weight initialization range for the input-to-hidden weights. The initialization range of the hidden-to-output weights can be selected in order to maximize the expected magnitude of the back-propagated deltas at the hidden nodes [321]. The expected magnitude of the back-propagated error is an increasing function of the weight range for small weight ranges. (If all the weights were zero, the back-propagated error would be zero.) But for large weight ranges, the output nodes saturate often so the back-propagated deltas are small. Rojas [321] reports thatwo values between 0.5 and 1.5 give similar results in empirical tests.

### 7.1.3 Initialization of Bias Weights

It was noted in section 3.1 that the distance d of a node's hyperplane from the origin is controlled by the bias weight

 (7.29)

where w is the weight vector excluding the bias weight. When weights are initialized randomly, d will sometimes be large and the hyperplane may be far from the region containing the inputs. A remedy is to choose wbias < ||w|| so the initial hyperplane always intersects the unit hypercube around the origin. This idea is mentioned in Palubinskas [294] among other places.

### 7.1.4 Constrained Random Initialization I

Some heuristics for setting the initial weights in order to decrease the chance of the network becoming trapped in a local minimum are discussed by Wessels and Barnard [393]. The following types of irregularities are defined:

Type 1. Stray hidden nodes whose decision boundaries have drifted out of the region of the input space sampled by the examples. These nodes have nearly constant activation for all training inputs and contribute little useful information.

Type 2. Hidden nodes duplicating function due to failure of symmetry breaking.

Type 3. Hidden node configurations that result in all nodes being inactive in some regions of the input space, making the network insensitive to inputs there.

Type 1 and 2 errors are common with random initialization. As an alternative, initial weights can be chosen systematically so that the following occur [393]:

• The decision boundary of every hidden node crosses the region covered by the samples (to avoid type 1 errors).

• The decision boundaries have a wide range of different orientations (to avoid type 2 errors).

• The transition region of each hidden node covers about 20% of the input space (to avoid type 2 errors). When weights are initialized with small values, the nodes tend to compute nearly linear functions. Because the sum of two linear functions is also linear, the net might use two nodes to do what could be done by one. Initializing with weights large enough to make the node functions somewhat nonlinear helps to avoid this.

• Every part of the sampled region has at least one active hidden node (to avoid type 3 errors).

The initial hidden-to-output weights are set to the same small value, for example, 0.25. Random hidden-to-output weights are not required because symmetry is broken by the way the input-to-hidden weights are initialized. It might even be counterproductive if it masks the activities carefully set up in initializing the hidden node weights. Because the error back-propagated to a hidden node is proportional to its output weights, setting the weights equal makes each hidden node equally responsive to all the outputs. Very small values cause slow learning while values larger than 1 tend to cause saturation because large δ values are propagated back from the output nodes. The 0.25 value was based on empirical tests. Performance was said to be relatively insensitive to the exact value as long as it was not large enough to cause saturation (in which case performance dropped off drastically).

### 7.1.5 Constrained Random Initialization II

A similar method is described by Nguyen and Widrow [283]. Weight vectors are chosen with random directions, magnitudes are adjusted so each node is linear over a fraction of the input space with some overlap of linear regions between nodes with similar directions, and thresholds are set so the hyperplanes have random distances from the origin within the region occupied by the input data.

The following recipe gives similar results. Let w represent the weight vector excluding the bias and let θ denote the bias weight. The weighted sum into a node is u = wTx + θ

1. First, set the weights so each vector has a random direction. A Gaussian or other spherically symmetric distribution should be used because this makes all directions equally likely; a uniform distribution tends to favor directions pointing to corners of the hypercube.

1. Adjust the magnitude of w so the linear region covers a fraction of the input space. The best width for the linear region depends on the number of hidden nodes; with fewer hidden nodes, the linear region has to be wider so every point in the input space is covered by the linear region of some node.

The linear region of a sigmoid-like node roughly covers the region from (-usat to +usat). For tanh nodes, usat = In 19 = 2.94. If the inputs lie in the interior of the unit hypersphere, the maximum weighted sum occurs when x = w

To make the linear region approximately 0.4 long (1/5 of the diameter of the input space), this should be about 5 times usat

Normalization of w to a magnitude gives this result

1. Set the threshold so the distance of the hyperplane from the origin has a random distribution between 0 and 1 (again assuming inputs lie in the unit hypersphere). The distance of the hyperplane from the origin is d = θ/||w|| so choose

 (7.32)

where τ is a random number between 0 and 1.

### 7.1.6 Remarks

• Many random weight initialization methods attempt to specify an appropriate range of initial weights. The equivalence between scaling weights by a constant factor and introducing a gain term in the sigmoid function means that similar results can be obtained by gain-scaling (section 8.7).

• There is some suggestion that on-line learning can tolerate saturation problems caused by large initial weights better than batch learning [241].

• In [368], no significant difference was found between uniform, normal, and unbalanced uniform distributions for initializing higher-order perceptrons. Empirical tests favored the method of section 7.1.4, but other methods gave similar results.

• The effects of initial weights on convergence time are examined by Kolen and Pollack [216], [217]. Plots of convergence time vs. initial weights (displayed as two-dimensional slices through the weight space) show fractal structure. Convergence regions are separated from nonconvergence regions by complex borders and certain mappings cannot be learned from initial weights in certain regions of the weight space.