16.6
Information Minimization
A heuristic for improving generalization based on the idea
of information minimization is described by Kamimura, Takagi, and Nakanishi [205]. The uncertainty of a sigmoidal node is taken to be
maximum when its activation is 0.5. A pseudo-entropy of the network for a
particular set of patterns is defined as
(16.5) |
 |
where K is the number of input
patterns, M is the number of hidden units,
and vki is the
activation of unit i for pattern k. The information in the network is given as
The entropy is used as a penalty function to minimize the
information contained in the network so the augmented error function is
where Eo is the
standard sum of squared errors. Minimizing E'
adds the term
to the weight adjustment rule, giving
Here, δki = ∂Ek/∂ai is the back-propagation delta term calculated
in section 5.2. The use of H
as a penalty term makes this an example of a regularization method. This also
has effects similar to weight decay because (i) the entropy of a sigmoidal node is maximum when
its output is 0.5, (ii) the output is 0.5
when the input is 0, and (iii) the input is 0
when the input weights are 0; that is, minimizing the weights would tend to
minimize -H.