Skip to Book Content
Book cover image

Appendix B - Principal Components Analysis

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

Appendix B: Principal Components Analysis

Overview

Suppose we have a set of zero-mean n -dimensional data vectors whose elements are correlated (figure B.1). If the data is not zero-mean, we can make it so by subtracting the average value beforehand. Because of the correlation, the data is partially redundant; knowledge of one variable gives us approximate information about the values of other variables.

It might be more natural to describe the data in terms of its variation along directions v1 and v2 (figure B.1). Most of the positional information is conveyed by the distance along v1 with the distance along v2 adding only a small correction. For highly correlated data, the contribution from v2 will be small compared to that of v1, so we might choose to ignore v2 completely. This introduces some error but gives us a more compact description. Perfectly correlated data would lie along a line (or on a hyperplane, in general), so v2 would be zero and we would lose nothing by describing the data entirely in terms of the distance along v1.

Correlation can arise because a system has internal variables v1, v2, vn the effects of which are indirectly observed through intermediate variables x1, x2, xn. Each observable variable is potentially affected by each internal variable. The observed variables may depend linearly on the internal variables, for example

By observing correlations in the observed in the data, we may be able to identify the internal variables that control the system behavior. Knowing the correlations, we can then measure several observable variables and get good estimates of the internal state.

The purpose of principal components analysis (PCA) is to identify the important directions of variation of a data set. Singular value decomposition and the Karhunen-Loéve transform have similar goals and are closely related techniques. The result may be a more natural coordinate system better aligned with the major axes of variation of the data. Sometimes these axes will correspond to natural features of the data.

Consider an arbitrary unit-length vector v. The projection xv of a point x onto v is the point on v that is closest to x

(B.1)

This is a vector with magnitude (xTv) extended along the unit vector v. Note that the residual component, є = x (xTv) v, is orthogonal to v, that is, єTv = xTv-(xTv)vTv = xTv - (xTv) = 0.

Obviously, the projected magnitude depends on the orientation of v. Given a set of vectors {xi}, we can search for a unit vector v that maximizes the mean squared value of this projected distance. By definition, this yields the first principal component of the data set. That is, the first principal component of a set of zero-mean vectors {xi}, i = 1m, E [x] = 0, is the vector v1, which maximizes the variance of the projected magnitudes

Figure B.1: Correlated data are partially redundant because knowledge of one variable gives approximate knowledge of the other variables. In two dimensions, perfectly correlated data lie on a straight line; in n dimensions, they lie on a lower dimensional subspace.
(B.2)

After finding the first principal component, subtract the projection along that direction to get an (n - 1)-dimensional data set lying in a subspace orthogonal to v1. Then search for a variance maximizing vector in this reduced space to obtain the second principal component, v2. Further repetition yields the remaining components, v3, v4,Because the vectors vi are unit-length and orthogonal, they form an orthogonol set

(B.3)

The vectors vi can be calculated as follows. Let y = xTv be the projected distance of x along v. Since E [x] = 0, E [y] = 0. The variance of y is then

(B.4)

where R = E [xxT] is the covariance matrix of the vectors x. R is an n x n symmetric matrix so its eigenvalues are real. Because it is a covariance matrix, its eigenvalues are nonnegative. To maximize the variance subject to the condition that ||v|| = 1 we can use the cost function

(B.5)

where μ is a Lagrange multiplier. Taking the derivative with respect to v and setting equal to zero gives

(B.6)
(B.7)

This is an eigenvalue problem. For a nonnull solution to exist, μ must be chosen to satisfy the characteristic equation

(B.8)

That is, μ must be an eigenvalue of R and v must be the corresponding eigenvector. Taking λ as the eigenvalue and substituting into (B.4), we have

(B.9)

In summary, the direction vector that maximizes the variance of the projection is given by the principal eigenvector v1 of R, and the corresponding eigenvalue λ1 measures the variance of the projection along that direction. Similarly, eigenvector v2 maximizes the variance of the projection in the residual space orthogonal to v1 and so on. Assuming the eigenvalues are distinct, we can number them in order of decreasing magnitude

(B.10)

Assuming R has full rank, its eigenvectors form an alternate coordinate system with coordinate vectors numbered in order of their importance in explaining the variation of the data set. A vector x can be expressed as the sum of its projections along the orthogonal components vi

(B.11)

where αi = xTvi is the projection of x onto the ith coordinate vector. The vector α = (α1, α2,αn) is the representation of x in the new coordinate system. If V denotes then xn matrix containing eigenvector vi in column i then in matrix notation

(B.12)

The eigenvectors are orthonormal so V is unitary

(B.13)

Note that ||x||2 = xTx = αTVTVα = αTα = ||α||2.

As noted, the principal component vectors of a data set form an orthogonal coordinate system with coordinate vectors numbered in order of their importance in explaining the variation of the data set. If R has rank r < n, it is singular and n - r of its eigenvalues are zero. Some elements of x are exactly predicted by linear combinations of other elements so the data lies on an r-dimensional linear subspace embedded in the n-dimensional space. This provides an opportunity for data compression because the data can be described by fewer numbers in the new coordinate system. There is no variation along n r dimensions so we could omit those elements in the representation and obtain a more compact description without loss of information. Even if R has full rank, some of its eigenvalues may be small in which case the data has little variation along the corresponding dimensions. Element i contributes αivi to the position information. The mean squared error introduced by omitting element i is <α2i> = λi. Obviously, if we omit any component, it should be the one with the smallest eigenvalue since this incurs the smallest error. Likewise, if we omit m components, they should be the elements with the m smallest eigenvalues. In general, the rank r linear projection of a data set, r < n, with the lowest mean squared error is the projection onto the first r principal components of the data.

In practice, measurement noise and numerical errors complicate the process of calculating the eigenvectors. Some of the estimated eigenvalues may be very small but not identically zero so some judgment is required to decide if they should be set to zero or not. Numerical analysis texts suggest singular value decomposition as a preferable method for obtaining the projection directions since formation of the covariance matrix tends to square the numerical errors.

As a side note, figure B.2 illustrates the effect of not removing the mean vector. For the zero-mean data plotted in (a), the eigenvectors of the data correlation matrix accurately reflect the axes of data variation and the eigenvalues (0.9595 and 0.0417) estimate the variance along those directions. In (b), the same data is offset by m = (1,2); the resulting eigenvectors are rotated and the eigenvalues (5.7016 and 0.2968) are larger. In this case, the first eigenvalue is dominated by the length of the offset vector, ||m|| = 5. As noted in appendix A the maximum stable learning rate for gradient descent is inversely proportional to the maximum eigenvalue of the data correlation matrix so smaller learning rates must be used to avoid stability problems and learning may take longer if the mean is not removed.

Click To expand
Figure B.2: Effect of nonzero-mean on the eigenvectors of the correlation matrix. (a) For zero-mean data, the eigenvectors of the correlation matrix indicate the main axes of data variation and the eigenvalues reflect the variance along each dimension. (b) For nonzero-mean data, the eigenvectors are rotated and the eigenvalues are influenced by the length of the offset vector.