Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

B.2 Discriminant Analysis Projections

Although principal components is useful for data reduction, it does not always produce good directions for discriminating between output classes. As an unsupervised method, it sees only the input vectors and is blind to classification information. The problem is that large variations in the data do not always correspond to useful information; the variation could be due to noise or irrelevant signals from other processes. Ideally we would like to remove these sorts of irrelevant variations during preprocessing, but this is not always possible. Figure B.4 illustrates the problem. In figure B.4a, the main contribution to the variance of the input data comes from the separation between the class means so the directions found by PCA will be useful for discriminating between classes. In figure B.4b, however, the major axis of variation is along direction v₁ but the classes are separated along a minor direction v₂. If v₂ were removed for data compression, a system presented with the reduced data would not be able to separate the classes based only on the information in v₁.

Linear discriminant analysis (LDA), [130], for example, provides a way to reduce dimensionality in a supervised learning context. It has dimensionality reduction properties like principal components analysis, but also accounts for class information in forming the projection. Given a set of Gaussian clusters corresponding to different target classes, the goal is to find a lower dimensional linear projection that maximizes the separation between class means and minimizes the spread of each cluster. Ideally, this minimizes overlap of the clusters in the projection and allows for unambiguous classification.

Figure B.4: Although principal components analysis sometimes produces directions useful for discriminating between classes, this is not always true. In (a) the main contribution to the variance of the input data comes from the separation between class means so the principal component directions are useful for discriminating between the classes. In (b) however, the major axis of variation is along direction v₁ but the classes are separated along the minor direction v₂.

Following [130: chapter 10], suppose the data consists of m points x_i, i = 1 … m, grouped into K clusters which correspond to classes. Let m_k, k = 1…K be the mean vector of cluster k and m_o be the overall mean vector. The within-class scatter matrix W measures the covariance of the data points around the mean of their respective classes

(B.14)

where P_k is the probability that a randomly selected point belongs class k. The betweenclass scatter matrix B measures the covariance of the class means around the overall mean

(B.15)

where m_o = E [x] = Σ^K_k=1 P_km_k. The mixture scatter matrix is the overall covariance matrix of all points, regardless of their class

(B.16)

These matrices are chosen to be invariant to coordinate shifts.

We would like to find a projection that maximizes the separation between class means and minimizesthe sizes of the projected clusters. The ideal projection would yield small, widely separated clusters with no overlap. A number of criteria can be used to measure cluster separability for optimization purposes. These include [130]

J₁ = Tr(S^-1₂S₁)
J₂ = ln|S^-1₂| - ln|S₂|
J₃ = Tr(S₁) - μ(Tr(S₂) - c)
J₄ = Tr(S₁)/Tr(S₂)

where S₁ and S₂ are one of B,W, or M. Possible combinations for {S₁, S₂} include {B, W}, {B, M}, and {W, M}. Remarks on these choices can be found in [130].

Consider the matrix W^-1B using S₁ = B and S₂ = W. B and W are covariance matrices describing the variation between cluster means and within clusters, respectively. The principal eigenvectors of B maximize the spread of the projected class means. Similarly, the principal eigenvectors of W^-1 minimize the average size of the projected clusters (because I/λ is an eigenvalue of W^-1 if λ is an eigenvalue of matrix W). Intuitively at least, the matrix W^-1B seems like a reasonable compromise to accomplish both purposes.

It turns out [130] that the linear projection maximizing J₁ consists of projection onto the principal eigenvectors of S^-1₂. Although S^-1₂S₁ is not necessarily symmetric, it is the product of matrices with real nonnegative eigenvalues so its eigenvalues will be nonnegative real and its eigenvectors orthogonal. A p-dimensional projection is obtained by extracting the p principal eigenvectors of the matrix. B has rank K - 1, where K is the number of clusters, because it is the covariance matrix of K class means so the dimension of the projection is limited by the number of clusters.

Figure B.5: The projection vectors found by discriminant analysis (MDA) and principal components analysis (PCA) for a simple data set.

Figure B.6: Projection histograms for the MDA and PCA directions shown in figure B.5: (a) the principal components projection shows cluster overlap, and (b) the discriminant analysis projection separates the clusters well.

Figure B.7: A single-hidden-layer linear network trained to perform classification with a 1-of-N target representation implements a form of discriminant analysis [385, 134].

Figure B.5 illustrates the difference between the directions found by discriminant analysis and principal components analysis on a simple problem. Figure B.6 shows histograms of the resulting projections. In this example, the discriminant analysis projection separates the clusters well and the principal components projection shows cluster overlap.

Remarks A basic assumption in this analysis is that the clusters are roundish blobs described entirely by their mean and covariance structure, that is, Gaussian clouds. Optimal projections will not necessarily be found for more complex shapes. Sometimes a class consists of several distinct clusters; the compound cluster does not have the required structure so nonoptimal projections will probably result. One remedy is to present the algorithm with cluster labels instead of class labels, but this, of course, requires knowledge of the cluster structure.

Like PCA, discriminant analysis can be used to reduce the dimensionality of the data presented to a network. This reduces network size and may speed up training significantly if necessary information is not lost. Discriminant analysis is used for this purpose in [309].

Like PCA, discriminant analysis is a linear projection method so it can fail where nonlinear transformations are necessary. Neither will be able to separate classes that are not linearly separable, for example, but both techniques remain useful for neural network design in cases where the linear projection makes sense. Although discriminant analysis is a little more complex, it makes use of the class information and therefore gives better separation than principal components in certain cases. When the data have linear dependencies, both techniques can be useful for preprocessing.

Just as the linear autoencoder implements a form ofPCA, it has been shown [385], [134] that a single-hidden-layer linear network trained to perform classification with a 1-of-N target representation implements a form of discriminant analysis (figure B.7).

Appendix B - Principal Components Analysis

B.2 Discriminant Analysis Projections