Figure 1.1: (a) A real neuron collects input signals from
other neurons through a dendritic tree, integrates the information, and
distributes its response to other neurons via the axon.(b)An artificial neuron
model.
Figure 1.2: The sigmoid is a continuous, bounded, monotonic
function of its input x. It saturates at 0 for large negative inputs and at 1
for large positive inputs. Near zero, it is approximately linear.
Figure 2.1: Supervised learning is often viewed as function
approximation problem. Given a set {(xi, ti)}, i=1 … M, of training pairs with
inputs xi and target outputs ti, the goal is to find a function f(x) that
captures the input-output relationships illustrated in the training examples,
f(x)i) ≈ ti. If the search is successful, the new function can then be used to
estimate the correct output for new points not in the original training set.
Ideally, the functional form may also be more compact and faster to evaluate.
Figure 2.2: Supervised learning model. In supervised
learning, a desired output is specified for every pattern encountered during
training. Differences between the network output and training target are
treated as errors to be minimized by the training algorithm.
Figure 2.3: Supervised learning can be applied to many
different error functions. The figure illustrates a piecewise linear error
function with upper and lower tolerance limits; the error is zero when f(x) is
within the limits. Functions like this are sometimes useful in engineering
applications.
Figure 2.4: In supervised learning with a distal teacher
the network output drives another system, T, which produces the final output.
This makes the teacher's job easier because targets can be specified at a
higher, less detailed, level. When T is a known function, the low-level
signals needed for network training can be derived from the high-level errors.
Figure 2.5: Learning with a distal teacher, robot arm
example: (a) a robot arm, (b) a neural network translates input commands to
joint angles that put the manipulator in the desired position after
transformation by the physics of the robot arm.
Figure 3.1: A single-layer perceptron has no hidden layers.
One layer of weights connects the inputs directly to the outputs. The outputs
are independent so this network can be treated as three separate networks.
Figure 3.2: Node function. Each node I computes a weighted
sum of inputs and passes the result through a nonlinearity. Typically a
bounded monotomic function such as the sigmoid.
Figure 3.3: Projection of x onto w. The output of a single
unit is determined by the inner product of the input vector x and the weight
vector w: u = wTx = || w|| || x.|| cos φ. All inputs xwith the same projection
onto wproduce the same output. The locus of points with equivalent projections
onto w defines a hyperplane perpendicular to w.
Figure 3.4: Effect of bias weights. A linear threshold unit
divides its input space into two half-spaces: (a) without bias, the dividing
surface must pass through the origin and certain data sets will not be
separable; (b) with bias, the dividing surface can be offset from the origin
to obtain better classification.
Figure 3.5: Graded responses: (a) a hard-limiter divides
the input space with a hyperplane, (b) a sigmoid gives a similar response, but
has a smoother transition at the boundary, and (c) a sigmoid with small
weights gives an almost linear response for a wide range of inputs.
Figure 3.6: (a) Linearly separable classes can be separated
by a hyperplane. In two dimensions, the hyperplane is a line. (b) classes that
are not linearly separable cannot be separated by a hyperplane.
Figure 3.7: The two-input exclusive-OR function is a simple
binary function that is not linearly separable and thus not learnable by a
single-layer perceptron. In more than two dimensions, it generalizes to the
parity function, which is 1 when the number of active inputs is odd and 0 when
it is even
Figure 3.8: Capacity of a hyperplane. Shown is the
probability that a random dichotomy on N points in d-dimensions is linearly
separable, for various values of d. The probability is greater than 1/2 for N
< 2(d + 1), and less than 1/2 for N > 2(d + 1). As d increases, the
transition from high to low probability becomes steeper and approaches a step
function.
Figure 3.9: The separation problem, (a) A weight vector
must be found that separates the positive samples (open circles) from the
negative samples (filled circles). (b) The transformation z = tx reflects the
negative examples across the origin and changes the problem to one of finding
a weight vector such that w.z > 0 for each transformed pattern z. D(w) is
the minimum distance from any point z to the separating hyperplane defined by
w. (c) The difficulty of a problem is determined by how much w can be varied
and still meet the separating criterion. Easy problems can be satisfied by
many w vectors.
Figure 4.1: MLP structure. A multilayer perceptron is a
cascade of single-layer perceptrons. There is a layer of input nodes and a
layer of output nodes with one or more hidden layers in between.
Figure 4.2: Three layers of linear threshold units are
sufficient to define arbitrary regions. Units in the first hidden layer divide
the input space with hyperplanes, units in the second hidden layer can form
convex regions bounded by these hyperplanes, and output units can combine the
regions defined by the second layer into arbitrarily shaped, possibly
unconnected, regions. Here, the boundaries are piecewise linear, but any
smooth boundary can be approximated with enough units (based on [247].)
Figure 4.3: Single-hidden-layer networks can recognize
nonconvex decision regions. Here, eight hidden units in a single layer create
a nonconvex square donut. Each dashed line is the decision boundary of one of
the hidden units. The hidden unit is active (1) on the side of the line
indicated by the pointer and connects to the output unit with the weight
indicated next to the pointer. The other numbers show the summed input to the
output unit in different regions. Thresholding at 8.5 creates the square
donut.
Figure 4.4: Weiland and Leighton [407] provide this example
of a nonconvex region recognized by a single-hidden-layer network.
Thresholding at 8.7 creates the 'C' shaped region: (a) a nonconvex decision
region, and (b) the network. There are 10 hidden nodes. Each has one
connection with weight w = ± 1 from either the x or y input. Numbers inside
the circles are the node thresholds.
Figure 4.5: Disjoint regions recognized by
single-hidden-layer networks. In addition to nonconvex regions,
single-hidden-layer networks can also recognize disjoint (unconnected)
regions.
Figure 4.6: Certain functions can be implemented exactly by
small networks with two hidden layers but require an infinite number of nodes
to approximate if constrained to a single hidden layer [255], [254]: (a) a
simple function that can be implemented with two hidden layers each containing
two threshold units, (b) an approximation by a network with just one hidden
layer, (from [255]).
Figure 4.7: AND and OR logic functions. Any logic function
can be implemented by a circuit of AND, OR, and NOT gates. Threshold units are
at least as powerful as logic circuits because they can implement all the
necessary elementary functions, as well as others: (a) the AND function, (b)
linear threshold unit implementation of AND, (c) the OR function, and (d)
linear threshold unit implementation of OR. The NOT function can be obtained
from a singleinput unit with a large negative weight and a zero threshold.
Figure 5.1: Feedforward indexing in an unlayered network.
The nodes in a feedforward network can always be indexed so that i > j if
the state of node depends on the state of node (perhaps indirectly). Arbitrary
connections are allowed from nodes with low indexes to nodes with higher
indexes, but not vice versa; i > j implies wji $equiv; 0. This network has
no particular function, but illustrates short-cut connections, apparently
lateral (but still feedforward) connections, and the fact that outputs can be
take from internal nodes.
Figure 5.2: Forward propagation. In the forward pass, the
input pattern is propagated through the network to obtain the output. Each
node computes a weighted sum of its inputs and passes this through a
nonlinearity, typically a sigmoid or tanh function.
Figure 5.3: Backward propagation. Node deltas are
calculated in the backward pass, starting at the output nodes and proceeding
backwards. At each hidden node i,δi is calculated as a weighted linear
combination of values δk, k > i,of the nodes k to which node i sends
outputs. Note that the δ values travel backward against the normal direction
of the connecting links.
Figure 5.4: Batch-mode back-propagation is a close
approximation to true gradient descent. At each step, the weights are adjusted
in the direction that minimizes the error the fastest. When the learning rate
is small, the weights trace a smooth trajectory down the gradient of the error
surface.
Figure 5.5: In on-line learning, weight updates occur after
each pattern presentation. The single-pattern derivative terms can be viewed
as noisy estimates of the gradient. They are not parallel to it in general,
but on average they have a positive projection onto it (because the gradient
is the sum of the single-pattern terms) so the error usually decreases after
most weight changes. Some terms may have negative projections or large
orthogonal deviations, however, which may cause the error to increase
occasionally.
Figure 5.6: In on-line learning, patterns are generally
presented in a random, changing order and weights are updated after each
pattern presentation. Instead of smoothly rolling down the error gradient, the
weight vector dances along a semi-random path, mostly moving downhill, but
occasionally jumping uphill. Upon reaching a low spot, the weight vector
jitters around the minimum but is unable to settle unless the step size is
reduced. Circles show the weights at the start of each epoch and line segments
show the single-pattern steps within each epoch.
Figure 5.7: When patterns are presented in a cyclic order
during on-line learning, as above, the sequence of steps in epoch t+ 1 tends
to be similar to the sequence in epocht so the weight trajectory has a
semi-periodic behavior. A danger is that the trajectory will converge to a
limit cycle, as shown, and be unable to reach the minimum. One way to break
the cycle is to change the pattern selection order. Reduction of the learning
rate may also break the cycle or, if done gradually, may reduce the diameter,
allowing it to close in around the minimum. Circles show the weights at the
start of each epoch and line segments show the single-pattern steps within
each epoch.
Figure 6.1: Average training time (solid) and convergence
probability (dashed) versus learning rate for α = 0,0.5, 0.9, 0.99. A 4/4/1
network was trained on the 4-bit parity problem with batch back-propagation
and the MSE error function. Each point is the average of 100 runs with
different random initial weights. Note: Use of the MSE error function rather
than SSE normalizes the learning rate by the size of the training set. See
section 6.1.1 for simulation details.
Figure 6.2: Actual training times versus learning rate for
various momentum values. At high learning rate and momentum, most networks
either do not converge or converge to poor solutions and become stuck. Of the
networks that do converge, however, most do so very quickly. Each vertical
strip shows points for 100 networks initialized with different random weights.
Note: Use of the MSE error function rather than SSE scales the learning rate
by the size ot the training set. See section 6.1.1 for details of the
simulation.
Figure 6.3: Average training time (a) and convergence
probability (b) versus normalized learning rate for α = 0, 0.5, 0.9, 0.99.
With momentum, the effective learning rate is amplified to η' = η/(1 - α). The
same effective learning rate can obtained from different combinations of η and
α, which yield roughly the same average training time but differ in the
location of the critical point where convergence becomes unlikely. In this
example, it appears that small learning rates and large momentums are more
stable. Note: Use of the MSE error function rather than SSE normalizes the
learning rate by the size of the training set. Simulation details are
described in section 6.1.1.
Figure 6.4: E(t) curves for small SSE learning rates. At
low learning rates, the E(t) curves are smooth,but convergence is slow. As n
increases, convergence time decreases is less reliable with occasional jumps
in error. All networks were initialized with the random weights.
Figure 6.5: E(t)curves for near-critical SSE learning
rates. Below some critical learning rate, convergence time decreases as the
learning rate η increases. At some point though the system becomes unstable
and fails to converge. Note that the change is abrupt; at η = 0.18 it
converges quickly but at η = 0.19 it overshoots the minimum and appears to get
trapped on a plateau. The same network and initial weight vector were used in
figure 6.4.
Figure 6.6: With on-line learning, patterns are chosen
randomly and the weights are updated after each pattern presentation. This
introduces noise into the training process, leading to noisy E(t) curves. In
general, larger learning rates cause higher noise levels.
Figure 6.7: E(t)curves for on-line learning. On-line
learning is stochastic so different runs with identical initial weights and
training parameters will follow different trajectories. Shown are four
different trials of on-line back-propagation applied to the 4-input parity
problem with η = 0.3, α=0. All networks started with the same initial random
weights. This learning rate appears high for this example—the noise level is
high and three of the nets get stuck in a local minimum.
Figure 6.8: On-line learning E(t) curves with and without
momentum. Momentum also has a learning rate amplification effect in on-line
learning. With a small learning rate, η = 0.01, (using SSE) and no momentum,
the weight changes are small and convergence is slow. With the same learning
rate and α = 0.95, the weight changes are larger and this example converges
quickly. Both examples were initialized with the same random weights.
Figure 6.9: With momentum, the current weight change Δ w
(t) is a combination of a step down the negative gradient, plus a fraction 0 ≤
α <1 of the previous weight change. For α ≈ 1, opposing weight change
components (horizontal) approximately cancel while complementary components
(vertical) sum, producing a smoothing effect on the weight trajectory
Figure 6.10: Gradient descent trajectories: (a) with a
small step size (0.01), pure gradient descent follows a smooth trajectory, but
progress may be very slow; (b) cross-stitching When the step size is too big
(0.1) and the error surface has valleys, the trajectory may oscillate wildly
from one side to the other while creeping slowly along the length of the
valley to the minimum. Upon reaching the neighborhood of a minimum, it may
overshoot many times before settling down.
Figure 6.11: Gradient descent trajectories with momentum.
With momentum, opposite (side to side) changes tend to cancel while
complementary changes (along the length of the valley) tend to sum. The
overall effect is to stabilize the oscillations and accelerate convergence.
(a) When the step size is small, momentum acts to accelerate convergence (step
size 0.01 and momentum 0.99, cf. figure 6.10a). (b) Small amounts of momentum
also help to damp oscillations when the step size is too large (step size 0.1
and momentum 0.2, cf. figure 6.10b).
Figure 6.12: At the low end of the momentum range, α
increases generally lead to faster convergence. Here the E(t) curves are
smooth. Occasional error spikes may occur but the system recovers quickly. All
trajectories start from the same random weights.
Figure 6.13: E(t) curves for large momentum values.
Convergence becomes unreliable when the momentum is too large for the learning
rate η. This system converges quickly with α = 0.6 but not with higher values.
With α = 0.99, E(t) oscillates strongly and the system jumps from a poor
minimum to a worse one at about t = 100. All trajectories start from the same
random weights.
Figure 6.14: Oscillation in E(t) due to momentum. With
momentum, the weight vector has inertia, which allows it to coast up hillsides
in the E(w) surface. The larger the momentum and the smaller the learning
rate, the farther it can rise and the longer it takes to stop. This, in
combination with the E(w) surface, can lead to oscillation. Mathematically,
smaller &η and larger α give the dynamic system a longer time-constant,
visible here in a lower oscillation frequency. (These values were chosen to
exaggerate the effect; they are not necessarily recommended.)
Figure 7.1: Initialization from a decision tree. The
function computed by the decision tree can be duplicated by a network of
linear threshold units [340]. The tree computes a Boolean function of the
propositions evaluated at its decision nodes so a two-hidden-layer network is
sufficient if the tree's node decisions can be computed by linear threshold
elements. The network can be built by inspection of the tree. Training is not
required; (a) is a decision tree and (b) the equivalent network.
Figure 8.1: (a) The error surface of a classifier often has
many flat plateaus separated by steep cliffs. (b) Increasing the number of
samples creates more steps and moves them closer together (adapted from [181],
[183]).
Figure 8.2: With a smaller tanh gain (0.1 in this case),
the step transitions are smoother (cf.figure 8.1a). Gain scaling is equivalent
to scaling all weights by a constant factor though so this does not change the
basic shape of the error surface. In the figure, it corresponds to zooming in
for a closer view of the origin.
Figure 8.3: Replacing the {-1, 1} targets with {-0.9, 0.9}
values produces a small dip at the bottom of each cliff. For illustration
purposes, the tanh gain was reduced to 1/2 to make the dip wider and smaller
targets were used to make it deeper.
Figure 8.4: When the samples are not linearly separable,
the error surface has radial troughs and ridges. A gradient based optimization
method could easily get stuck in one of the troughs corresponding to a poor
solution (adapted from [181], [183]).
Figure 8.5: Single-pattern weight update vectors. With an
SSE cost function, the E(w) surface is the sum of individual surfaces for each
pattern and the total gradient is the sum of the single-pattern gradients. On
a hillside (a), most of the vectors point in a dominant direction. On a ridge
(b) or at the bottom of a valley (c), there are often two bundles of vectors
pointing in opposite directions. At a local minima (d) the vectors sum to
zero; they may be large and evenly distributed in all directions, or they may
all go to zero. Point (e) shows a relatively flat spot.
Figure 8.6: Local and global minima. A global minimum of a
function can be defined as its lowest point, the input that gives the lowest
possible output. A local minimum is a point that is lower than all its
neighbors, but higher than the global minimum.
Figure 8.7: The error surface for a 1/1/1 network trained
on the identity mapping illustrates that local minima do in fact exist. This
surface has a good minima in the lower left quadrant, a poor minimum in the
upper right quadrant, and a saddle point at the origin: (a) the contour plot,
and (b) a view looking across the origin down the axis of the poor minimum.
Figure 10.1: Conjugate directions. Powell's method uses the
parallel subspace property of quadratic functions to find a set of conjugate
directions without evaluating the Hessian. Given a quadratic function xTHx, a
direction d, and two points x1 and x2, if one does a line search from x1 along
direction d to obtain a minimum y1 and another line search from x2 along
direction d to obtain y2 then direction (y2 - y1) is H-conjugate to d [311].
Figure 10.2: Gradient descent trajectories: (a) With a
small step size (0.01), gradient descent follows a smooth trajectory, but
progress may be very slow. (b) Cross-stitching: When the step size is too big
(0.1) and the error surface has valleys, the trajectory may oscillate wildly
from one side to the other while creeping slowly along the length of the
valley to the minimum. Once it reaches the minimum, it may overshoot many
times before settling.
Figure 10.3: Gradient descent trajectories with momentum.
With momentum, opposite (side-to-side) changes tend to cancel while
complementary changes (along the length of the valley) tend to sum. The
overall effect is to stabilize the oscillations and accelerate convergence:
(a) When the step size is small, momentum acts to accelerate convergence (step
size 0.01 and momentum 0.99, cf. figure 10.2a); (b) small amounts of momentum
also help to damp oscillations when the step size is too large (step size 0.1
and momentum 0.2, cf. figure 10.2b).
Figure 10.4: Cauchy's method uses the rule: wherever you
are, evaluate the gradient and then move along that line until you find a
minimum.In each major iteration, it evaluates the gradient once and then does
a line search to find the minimum along that line.
Figure 10.5: The conjugate gradient method converges in
just N iterations (2 in this case) for an N-dimensional quadratic error
function, but each iteration involves a line search that may require many
evaluations of the error function.
Figure 10.6: For quadratic error functions, Newton's method
converges in a single step. For more general nonlinear functions, Newton's
method converges very quickly where the Hessian matrix is positive definite
(e.g., in the vicinity of a minimum) but may diverge elsewhere.
Figure 10.7: The Levenberg-Marquardt method is a compromise
between Newton's method, which converges very quickly in the vicinity of a
minimum but may diverge elsewhere, and gradient descent, which converges
everywhere, albeit slowly. In general,the trajectory starts out like gradient
descent but gradually approaches the Newton direction.
Figure 11.1: Selection. The genetic algorithm selects units
for reproduction with probability proportional to their fitness. Units with
higher fitness (corresponding to more slots on the wheel) are more likely to
be chosen than units with lower fitness, but all units have some chance of
being chosen.
Figure 11.2: Crossover. Crossover mixes genetic codes
inherited from two parents by crossing the bit strings at one or two random
points. The bit strings encode characteristics of the parents so the offspring
receives traits of both parents, but is not identical to either. The crossing
point is random, so the mix of characteristics transferred varies with each
mating.
Figure 11.3: The function considered in the example.
Figure 11.4: Representation of a neural network as a LISP
expression. W represents a multiplication operation by an input weight and P
represents a summation and the node nonlinearity.
Figure 12.1: In many constructive algorithms, new units are
added when the rate of improvement of the training error becomes small.
Normalization by the magnitude of the error E(to) obtained after the previous
unit was added makes the triggering condition less dependent on the size of
the error.
Figure 12.2: Cascade-correlation adds hidden units one at a
time. Each new unit receives connections (shown as solid dots) from the
external inputs and all existing hidden units. Its input weights are chosen to
maximize the covariance of its response with the residual output error.
Weights into existing hidden units remain unchanged. Then the weights from all
the hidden units (new and old) to the output node are retrained to minimize
the error. If the resulting error is acceptable, the network is complete;
otherwise a new hidden unit is added and the process repeated.
Figure 12.3: The two-spirals problem [233] is sometimes
used as a benchmark for constructive algorithms because it requires careful
coordination of many hidden nodes and is difficult to learn with simple
back-propagation in a MLP network. (It is not representative of most
real-world problems, however.) In a single-hidden-layer architecture, 40 or
more hidden nodes are generally needed and training times are long. Most
successful solutions use more than one hidden layer; some use short-cut
connections.
Figure 12.4: The upstart algorithm [128] starts with a
single output unit Z. If errors remain after training, then daughter units X
and Y are inserted to correct it. Ideally, X corrects Z when it is wrongly ON
and Y corrects Z when it is wrongly OFF. If either X or Y cannot correct all
the errors assigned to them, additional subunits are introduced to correct
their errors and so on. The result is a tree structure for which there is an
equivalent single-hidden-layer network.
Figure 12.5: The tiling algorithm [265] adds a new hidden
layer at each iteration. A master unit in each layer attempts to classify as
many patterns as possible based on input from the preceding layer. If it
cannot classify all patterns correctly, then 'ancillary' units are added until
the layer produces a 'faithful' representation such that any two input
patterns with different targets produce a different pattern of activity on the
layer. This then serves as input for the next layer. It is always possible to
create a master unit which makes fewer errors than the master unit in the
previous layer so convergence is guaranteed if enough layers are added.
Figure 12.6: The algorithm of Marchand, Golea, and Ruj´an
[256] creates a single-hidden-layer network of linear threshold units by
adding hidden units sequentially. Each new unit slices off a group of training
patterns that share the same target value. Filled and empty circles represent
training points with positive and negative targets. Lines show the hidden unit
decision surfaces; the numbers to the side indicate the order in which the
hidden units were created. All points are classified correctly, although some
are very close to the boundary.
Figure 12.7: Constructive methods can be based on a Voronoi
tessellation of the training points. The Voronoi diagram of a set of base
points partitions the space into cells depending on which base point is
closest. Each cell is a convex region bounded by hyperplanes. A layered
network can be constructed which forms the same partition and generates the
required outputs in each cell.
Figure 13.1: Effect of pruning: (a) response of an
overtrained 2/50/10/1 network, (b) response after pruning (659 of the 671
original weights and 56 of the 64 original nodes have been removed to produce
a 2/2/2/1 network—8 nodes including the bias node), (c) response after further
training.
Figure 13.2: Pruning error versus weight saliency of
optimal brain damage. Shown are the changes in error due to deletion of a
weight versus the OBD saliency for a 10/3/10 network trained on the simple
autoencoder problem.
Figure 13.3: Pruning error versus weight magnitude. One of
the simplest pruning heuristics is that small weights can be deleted since
they are likely to have the least effect on the output. Shown are the errors
due to deletion of a weight versus weight magnitude for a 10/3/10 network
trained on the simple autoencoder problem. There are two separate,
approximately linear trends. The upper group contains input-to-hidden weights
while the lower group contains hidden-to-output weights.
Figure 13.4: The normal weight-decay penalty term penalizes
large weights heavily, which discourages their use even when they might be
helpful. The weight-elimination penalty term in equation 13.19 saturates so
large weights do not incur excess penalties once they grow past a certain
value. This allows large weights when needed, while still encouraging the
decay of small weights.
Figure 13.5: A pruning problem. The original network is
underconstrained for the 2-input XOR problem and chooses a correct but
unexpected solution. (The training points are the four corners of the square.)
A naive pruning algorithm is able to remove many redundant elements, but the
resulting network is unlikely to generalize better than the original.
Figure 14.1: Samples alone do not provide enough
information to uniquely determine an interpolating function. An infinite
number of functions can be fit through the sample points; all are equally
valid according to the sample set error and other criteria are needed to
choose among them.
Figure 14.2: Underfitting and overfitting by a minimum-MSE
approximation using M evenly spaced Gaussian basis functions with widths
inversely proportional to M: (a) underfitting (M = 3), the approximation is
too simple, (b) perhaps a reasonable fit (M = 5), and (c) possible overfitting
(M = 30).
Figure 14.3: Generalization error versus network complexity
for a minimum-MSE fit using M evenly spaced Gaussian basis functions for
various sample sizes N. For intermediate sample sizes, the curve has a minimum
at some intermediate value. (Each point is the average of 200 trials. Training
samples are evenly spaced with some jitter but no additive noise. The target
function is shown in the inset.)
Figure 14.4: For neural networks used as function
approximators, generalization usually means interpolation, not extrapolation.
A 1/5/1 network with tanh hidden nodes and a linear output was trained on 50
points uniformly distributed in the first half cycle of f(x) = sin(2πx). The
fit is good near the training data, but poor elsewhere.
Figure 14.5: A 2/50/10/1 network with 671 weights trained
on 31 points is very underconstrained. Although the points are nearly linearly
separable, with some overlap, the decision surface is very nonlinear and is
unlikely to generalize well. The solid line shows the network decision
surface, the 0.5 contour. Other contours are omitted because the transitions
are steep.
Figure 14.6: Generalization error versus sample size N for
the Gaussian basis function approximation system of figure 14.3.
Figure 14.7: Training and test set errors versus sample
size N for selected M values of the Gaussian basis function approximation
system of figure 14.3. In general, the training set error is small for small
sample sizes and rises as the number of samples increases while the test set
error is large for small sample sizes and decreases as the number of samples
increases. At large N, both approach the same asymptotic value.
Figure 14.8: Generalization error versus M for various
values of the additive noise variance σ. With a fixed sample size and
increasing amounts of additive noise, minima of the generalization error
curves occur at lower M values. The fitting function is the system of Gaussian
basis functions used in figure 14.3. Training sample size N = 20. Each point
is the average of 200 trials.
Figure 14.9: As training progresses, the generalization
error may decrease to a minimum and then increase again as the network adapts
to idiosyncrasies of the training data. Training and test errors are shown for
an underconstrained network trained by back-propagation on 30 examples of the
iris data and tested on the 120 remaining examples. (An underconstrained net
and small learning rate were used to demonstrate overtraining. Smaller
networks can learn the data in a much shorter time.)
Figure 14.10: One explanation of overtraining is that the
training error surface is similar to the true error surface but somewhat
distorted so the apparent minimum is offset from the true minimum. The figure
shows hypothetical error surfaces. Depending on how the network is
initialized, the weight trajectory may pass over a true minimum on its way the
apparent (training error) minimum.
Figure 14.11: Different training algorithms may yield
different generalization results. Shown are responses for a 2/2/1 net trained
by various algorithms on the 2-bit XOR problem. Batch and on-line training
appear to give reasonably smooth symmetric responses here (values in
parentheses are the learning rate and momentum). RProp and conjugate gradient
descent (CGD) appear to create less symmetric surfaces and allow sharp
transitions near the training points at the corners. Networks (h), (j), and
(k) appear sensitive to the training data and are unlikely to generalize well
if the input values are changed slightly.
Figure 16.1: Effect of training with weight decay. A
2/50/10/1 network was trained using normal back-propagation for 200 epochs.
Then weight decay was set to 1E-4 and training resumed for a total of 5000
epochs. Unlike figure 14.5, the decision surface is very simple and does not
show obvious signs of overtraining.
Figure 16.2: Effects of weight decay: (a) response of a
2/50/10/1 network trained by batch back-propagation until all patterns were
correctly classified at about 11,000 epochs; (b) response after 20,000 epochs
of a network trained from the same starting point with weight decay increasing
to 0.001 at 4000 epochs; and (c) response of the network in (b) after 1000
more epochs with the learning rate decreased to 0.01.
Figure 17.1: Convolution tends to be a smoothing operation.
A step function, t(x), convolved with a Gaussian, pn(x), produces the Gaussian
cumulative distribution. This resembles the original step function, but it is
a smooth function similar to the sigmoid.
Figure 17.2: A nearest neighbor classification problem: (a)
the Voronoi map for 14 points; and (b) the expected value of the
classification given the noisy input as calculated in (17.5) for a Gaussian
noise distribution with Σ = 0.1. (c) The convolution of the training set with
the same Gaussian noise. The zero contours of (b) and (c) coincide. (Is and Os
indicate classes; values 1 and -1 were actually used.)
Figure 17.3: Smoothing effects of training with jitter: (a)
an intentionally overtrained 2/50/10/1 feedforward network chooses an overly
complex boundary and can be expected to generalize poorly; (b) the same
network trained with Gaussian (Σ = 0.1) input noise forms a much smoother
boundary and better generalization can be expected; and (c) the expression in
equation 17.5 for the expected value of the target function at an arbitrary
point x.
Figure 17.4: The conventional sigmoid 1/(1 + e-x) and the
Gaussian cumulative distribution function (GCDF) (with Σ = 4/2π) have very
similar shapes and give similar results when used as the node nonlinearities.
The GCDF is useful in this analysis because it is shape invariant when
convolved with a spherical Gaussian noise density.
Figure 17.5: Equivalence of weight scaling and jitter
averaging: (a) the transfer function of the original network and (b) its
contour plot; (c) the average response with additive Gaussian input noise, σ =
0.1, averaged over 2000 noisy samples per grid point and (d) its contour plot;
and (e) the expected response computed by scaling and (f) its contour plot.
Figure 17.6: Weight-decay effects of training with jitter:
(a) weights for the overtrained network of figure 17.3(a), σ = 0.7262; and (b)
weights for the jitter-trained network of figure 17.3(b), σ = 0.3841.
Figure 17.7: Training with static noise: (a) response of an
underconstrained 2/50/10/1 net. The boundary is complex and transitions are
steep, but the data is almost linearly separable. (b) Response of the same net
trained on an enlarged data set obtained by replacing each original training
point by 30 noisy points (σ = 0.1). The boundary is simpler and transitions
are more gradual, but a few kinks remain. (1s and 0s denote the training
points, the training values were 0.9 and -0.9.)
Figure 17.8: Cross-validation with jittered data. An
artificial validation data set was created by generating 30 jittered points
from each of the original 31 training points: (a) response of an
underconstrained 2/50/10/1 net trained to convergence, and (b) response of the
net with the best RMS error on the validation set (1s and 0s denote the
training points, the training values were 0.9 and -0.9).
Figure 17.9: Smoothing an overtrained response. Given an
overtrained net, a better estimate of the true function at a point x might be
obtained by averaging a number of probes around x using a oisy input. The
figure shows the expected response of the network in figure 17.7(a) to a noisy
input (σ = 0.1) (1s and 0s denote the training points, the training values
were 0.9 and -0.9).
Figure B.1: Correlated data are partially redundant because
knowledge of one variable gives approximate knowledge of the other variables.
In two dimensions, perfectly correlated data lie on a straight line; in n
dimensions, they lie on a lower dimensional subspace.
Figure B.2: Effect of nonzero-mean on the eigenvectors of
the correlation matrix. (a) For zero-mean data, the eigenvectors of the
correlation matrix indicate the main axes of data variation and the
eigenvalues reflect the variance along each dimension. (b) For nonzero-mean
data, the eigenvectors are rotated and the eigenvalues are influenced by the
length of the offset vector.
Figure B.3: An autoencoder network maps an input vector to
itself. Given an input, the goal is to reproduce it at the output. A small
hidden layer acts as a bottleneck and forces the network to find a compressed
representation of the input pattern. With a linear network and a least squares
error function, the ideal internal representation is related to the principal
components of the training data.
Figure B.4: Although principal components analysis
sometimes produces directions useful for discriminating between classes, this
is not always true. In (a) the main contribution to the variance of the input
data comes from the separation between class means so the principal component
directions are useful for discriminating between the classes. In (b) however,
the major axis of variation is along direction v1 but the classes are
separated along the minor direction v2.
Figure B.5: The projection vectors found by discriminant
analysis (MDA) and principal components analysis (PCA) for a simple data set.
Figure B.6: Projection histograms for the MDA and PCA
directions shown in figure B.5: (a) the principal components projection shows
cluster overlap, and (b) the discriminant analysis projection separates the
clusters well.
Figure B.7: A single-hidden-layer linear network trained to
perform classification with a 1-of-N target representation implements a form
of discriminant analysis [385, 134].