| |||||
| |||||
A heuristic for improving generalization based on the idea of information minimization is described by Kamimura, Takagi, and Nakanishi [205]. The uncertainty of a sigmoidal node is taken to be maximum when its activation is 0.5. A pseudo-entropy of the network for a particular set of patterns is defined as
where K is the number of input patterns, M is the number of hidden units, and vki is the activation of unit i for pattern k. The information in the network is given as
The entropy is used as a penalty function to minimize the information contained in the network so the augmented error function is
where Eo is the standard sum of squared errors. Minimizing E' adds the term
to the weight adjustment rule, giving
Here, δki = ∂Ek/∂ai is the back-propagation delta term calculated in section 5.2. The use of H as a penalty term makes this an example of a regularization method. This also has effects similar to weight decay because (i) the entropy of a sigmoidal node is maximum when its output is 0.5, (ii) the output is 0.5 when the input is 0, and (iii) the input is 0 when the input weights are 0; that is, minimizing the weights would tend to minimize -H.