Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

3.1 Hyperplane Geometry

Equations 3.1 and 3.2 are fundamental to most of the networks considered later so it is useful to examine them more closely. The locus of points x with a constant sum defilinehyperplane perpendicular to the vector w. The Euclidean vector norm ||x|| = measures vector length. Because w^Tx = ||w|| ||x|| cos φ, where φ is the angle between w and x, u is proportional to the projection ||x|| cos φ of x onto w and all points with equivalent projections produce equivalent outputs (figure 3.3). The locus of points with equivalent projections on w are hyperplanes orthogonal to w, so the output y is a function of the distance from x to the hyperplane defined by w. The constant-output surfaces of (3.2) are hyperplanes perpendicular to w.

Orientation The orientation of the node hyperplane is determined by the direction of w. This depends on the relative sizes of the weights w_i but not on the overall magnitude of w. Let e_i be the unit vector aligned with the ith coordinate axis, for example, e₁ = (1,0,0,…, 0) (w_i will still be used to refer to the ith component of a vector, however). The angle φ₁ between the hyperplane normal and the ith coordinate axis is then

Figure 3.1: A single-layer perceptron has no hidden layers. One layer of weights connects the inputs directly to the outputs. The outputs are independent so this network can be treated as three separate networks.

Figure 3.2: Node function. Each node I computes a weighted sum of inputs and passes the result through a nonlinearity. Typically a bounded monotomic function such as the sigmoid.

Figure 3.3: Projection of x onto w. The output of a single unit is determined by the inner product of the input vector x and the weight vector w: u = w^Tx = || w|| || x.|| cos φ. All inputs xwith the same projection onto wproduce the same output. The locus of points with equivalent projections onto w defines a hyperplane perpendicular to w.

Figure 3.4: Effect of bias weights. A linear threshold unit divides its input space into two half-spaces: (a) without bias, the dividing surface must pass through the origin and certain data sets will not be separable; (b) with bias, the dividing surface can be offset from the origin to obtain better classification.

(3.4)

The orientation of the plane is independent of the magnitude of w because the ratios w_i/||w|| remain constant when w is multiplied by a constant.

Distance from the Origin As noted previously, the constant-output surfaces of (3.2) are hyperplanes perpendicular to w. More specifically, the weighted sum Σ_jw_jx_j = 0 defines a hyperplane through the origin. Inclusion of a threshold, or bias, term θ

(3.5)

shifts the hyperplane along w to a distance d = θ/||w|| from the origin. To see this, let v be the vector from the origin to the closest point on the plane. It must be normal to the plane, and thus parallel to w, so v = dw/||w||. The node hyperplane is the locus of points where

(3.6)

Figure 3.4 illustrates the utility of the bias term. Without bias, the decision surface must pass through the origin and so will be unable to separate some data sets. Addition of a bias allows the surface to be shifted from the origin to obtain better classification. To simplify analyses, the threshold is usually absorbed into the weight vector by assuming that one of the inputs is constant,x_bias = 1. The constant input is called the bias node

Figure 3.5: Graded responses: (a) a hard-limiter divides the input space with a hyperplane, (b) a sigmoid gives a similar response, but has a smoother transition at the boundary, and (c) a sigmoid with small weights gives an almost linear response for a wide range of inputs.

Gradation The node nonlinearity ƒ in (3.2) controls how the output varies as the distance from x to the node hyperplane changes. As noted, ƒ is usually chosen to be a bounded monotonic function. When ƒ is a binary hard-limiting function as in a linear threshold unit, the node divides the input space with a hyperplane, producing 0 for inputs on one side of the plane and 1 for inputs on the other side. With a softer nonlinearity such as the sigmoid, the transition from 0 to 1 is smoother but other properties are similar.

The magnitude of w in equation 3.3 plays the role of a scaling parameter that can be varied to obtain transitions of varying steepness. The slope of the transition is ||∂y/∂x|| = f'(u) ||w||, which is proportional to ||w||, the magnitude of the weight vector. For large ||w||, the slope is steep and the sigmoid approximates a step function. For small ||w||, the slope is small and y(x) is nearly linear over a wide range of inputs. Figure 3.5 illustrates functions with various degrees of gradation. In any case, the output is solely a function of the distance of the input from the hyperplane.

Chapter 3 - Single-Layer Networks

3.1 Hyperplane Geometry