Consider a network with n inputs, h < n linear hidden units, and n linear output units (figure B.3). What happens if we train the network to reproduce the input vector at its output? Given an input, the goal is to reproduce it at the output so the network acts as autoencoder, mapping an input pattern to itself.
This may seem pointless but note that h < n so the hidden layer acts as a bottleneck that forces the network to form a compressed representation of the data. The hidden layer activities are a linear function of the inputs but the hidden layer is smaller than the input dimension so some information must necessarily be lost, in general. The best hidden layer representation will be one that preserves as much information about the input as possible. Ideally, it will ignore nonessential noise and reproduce only the most significant features of the input pattern.
Bourlard and Kamp  showed that the optimal (in a minimum mean squared error sense) hidden unit weights are determined by a set of vectors spanning the singular value decomposition of the input data. That is, the ideal representation formed at the hidden layer spans the same space as the h eigenvectors corresponding to the h largest eigenvalues of the covariance matrix of the training data. The network is linear so it can be collapsed into a single linear transformation y = Fx. The rank of F is limited in the preceding by h, the dimension of the hidden layer. From the principal components discussion, we know that the best rank h transformation is the projection onto the first i principal component directions. It turns out that linear hidden nodes are optimal in this case; nonlinear nodes cannot improve the approximation and only cause problems by introducing local minima that may confuse gradient descent optimizers. (If A is the hidden-to-output weight matrix and h(x) is the vector of hidden unit activities, the function to be minimized is <||x-Ah||2>. The output is a linear function of h so the optimal h(x) is a linear transform of the inputs x.)
Baldi and Hornik  showed that the error function has a unique minimum at this solution. That is, for a linear network and a quadratic error function, the overall transformation F determined by the orthogonal projection onto the space spanned by the first h eigenvectors of R is a unique local and global minimum. Saddle points occur for solutions spanning other combinations of h or fewer eigenvectors of R. In principle, this means the solution can be found by gradient descent methods such as back-propagation although conventional linear algebraic methods are generally more efficient.
In general, the solution obtained by training from random initial weights will not be identical to the principal components decomposition because it is only necessary that the hidden unit activities span the same space as the first h principal components. Let B and A be the optimal weight matrices determined by singular value decomposition. B is an h x n matrix of input-to-hidden weights and A is an n x h matrix of hidden-to-output weights. The hidden layer computes h = Bx and the output computes y = Ah = ABx. Equivalent results can be obtained by the weights B' = CB and A' = AC-1 where C is any invertible h x h matrix. In general, C will depend on the random initial weights. The minimum identified by Baldi and Hornik  is unique in terms of the overall function F, which can be achieved by many different combinations of weight matrices A' and B'.
Because C can be a rotation matrix, there will not be a neat correspondence between the hidden units and the principal components projections. Principal components analysis separates the data into orthogonal components ranked in order of importance. Deletion of the first component will cause a larger error than deletion of the second and so forth. In a linear network trained by gradient descent from random initial weights, the contribution from each hidden unit tends to be more nearly equal. The autoencoder extracts the first h principal components but the matrix C typically spreads their functions approximately equally across the h hidden units . The activities of the hidden units will not necessarily be uncorrelated. This may favor fault tolerance, but it is not necessarily helpful in data compression applications where it is useful to be able to pick components serially in terms of their importance. With PCA, we can do one decomposition and inspect the eigenvalues to see how many components are needed to achieve a desired approximation error. In a trained autoencoder, the functions are mixed among the h hidden units, so this is not possible. To find an approximation spanning the first h -1 components, a completely new network with h-1 hidden units must be trained from scratch.
Many papers have been written on the links between neural networks and principal components analysis. An entire book on subject is . Oja and others , , ,  have investigated Hebbian learning rules that extract principal components. Biological feasibility and local computation are interesting features of these algorithms. Many are on-line methods requiring no data storage. (However, on-line versions of conventional algorithms also exist.) Others have investigated data compression applications, for example, , . More general results for linear networks, including but not limited to PCA, are surveyed in .
It is worth noting that, although neural networks are often very nonlinear, analysis of linear networks is informative because networks initialized with small weights tend to compute approximately linear functions in the early stages of training and only become significantly nonlinear after the weights grow to larger values. The initial training dynamics are often dominated by approximately linear interactions.
It should also be noted that a bottleneck structure does not automatically imply that the network must implement a principal components solution. The PCA solution is optimal only for linear networks or single-hidden-layer networks with linear outputs. With several nonlinear layers before and after the bottleneck, the bottleneck units are not linear functions of the input and cannot be interpreted in terms of principal components. In theory, anything could be transmitted through a bottleneck preceded by a sufficiently complex encoder and followed by a corresponding decoder. There are practical limits to this, of course.