### Appendix A - Linear Regression

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II

# Appendix A: Linear Regression

## Overview

It is useful to review linear regression because the mathematics of single-layer perceptrons are very similar, because more general networks are often cascades of single-layer networks, and because linear analyses are often useful first-order approximations of nonlinear systems. Consider a linear system

 (A.1)

where x is an input vector and w is a weight vector to be determined. Let d be the desired output for the given input x є RN and assume x and d have stationary statistics. The output y is one-dimensional here but the derivation is easily generalized to higher output dimensions. The single-sample error is e = d - y and the squared error is

 (A.2)

Let the error function be 1/2 the mean squared error (to suppress a factor of 2 later on)

 (A.3)

Note, E is the error function but E [.] denotes an expected value. Let P = E [dx] and R = E [xxT]. P is the input-target correlation vector with elements pj = E [dxj] and R is the input autocorrelation matrix with elements rij = E [xixj]. Then (A.3) can be written

 (A.4)

which is a quadratic function of w. R is a real symmetric matrix and thus positive-semi-definite, wTRw O, so E has a single global minimum. The derivative of E with respect to w is

 (A.5)

Setting this to zero produces

 (A.6)

which can be solved to obtain the optimum weight vector w*

 (A.7)

Numerical analysis texts suggest several ways to solve systems of linear equations that may be preferable to inversion when R is poorly conditioned.

Substitution into (A.4) and simplifying produces an expression for the minimum error obtained

 (A.8)

where μd and σd are the mean and standard deviation of the target. This may be smaller when μd = 0, which is reasonable because (A.1) does not include an offset term.

It can be shown that the error is a quadratic function of the difference w - w*

 (A.9)

It is minimum at w = w* and increases quadratically with the difference w - w*. At the optimum, the error is uncorrelated with the input

This makes sense because correlation would indicate that the error contains remaining linearly predictable elements that could be reduced further by modifying w.