Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

16.6 Information Minimization

A heuristic for improving generalization based on the idea of information minimization is described by Kamimura, Takagi, and Nakanishi [205]. The uncertainty of a sigmoidal node is taken to be maximum when its activation is 0.5. A pseudo-entropy of the network for a particular set of patterns is defined as

(16.5)

where K is the number of input patterns, M is the number of hidden units, and v^k_i is the activation of unit i for pattern k. The information in the network is given as

The entropy is used as a penalty function to minimize the information contained in the network so the augmented error function is

where E_o is the standard sum of squared errors. Minimizing E' adds the term

to the weight adjustment rule, giving

Here, δ^k_i = ∂E_k/∂a_i is the back-propagation delta term calculated in section 5.2. The use of H as a penalty term makes this an example of a regularization method. This also has effects similar to weight decay because (i) the entropy of a sigmoidal node is maximum when its output is 0.5, (ii) the output is 0.5 when the input is 0, and (iii) the input is 0 when the input weights are 0; that is, minimizing the weights would tend to minimize -H.

Chapter 16 - Heuristics for Improving Generalization

16.6 Information Minimization