Most treatments of PCA can be categorised into two: one that purely defines PCA in rigorous footingexplaining how the principal components are eigenvectors of the variance matrix and so on; and the other which purely talks of PCA geometrically mentioning nothing about how they actually correspond to the eigenvectors. In this post, I reconcile these two approaches: explaining PCA geometrically while still maintaining the rigour of it all.
Spectral decomposition
Among the class of diagonal, orthonormal and symmetric matrices, perhaps the easiest class of matrix to visualise is that of diagonal matrices. They just simply represent scaling the -th axis by the -th diagonal element. Orthonormal matrices are not hard to visualise either. They just represent the rotation which takes the -th canonical basis vector (i.e., the vector with in the -th position and elsewhere) to the -th column vector of the said matrix (possibly with some reflection involved if the determinant is negative). Symmetric matrices don’t seem to have such nice visualisation (after all symmetry seems wholly to be a property of the matrix itself and not the linear transformation it represents), but owing to the beauty of the Spectral theorem, one realises that symmetric (or more precisely, self-adjoint matrices) are just scaling transformations but in different axes. We briefly recall the Spectral theorem:
The danger of this theorem is understanding the algebra behind it, but not taking the moment to reflect what it means geometrically. First, notice that columns of corresponds to eigenvectors of , and the scalars in , the corresponding eigenvalues (this can be seen by expressing an arbitrary eigenvector as ).
Now, note that is the rotation that takes the canonical axes to the axes corresponding to the eigenvectors. Hence, the inverse of is the rotation that takes the axes corresponding to eigenvectors to the canonical axes. So, the Spectral theorem basically is saying we can decompose the transformation described by into: first, aligning its eigenvectors with the canonical axes so that they become the new axes, then, scaling it with the corresponding eigenvalues along these (new) axes, and finally rotating the eigenvectors back to how they were aligned.
This sequence of transformation is illustrated for the matrix:
This is what we start with: represents the -th standard basis vector, and the -th eigenvector.
Then aligns the eigenvector to become the new axes.
Then, the scaling by .
Finally, rotates the eigenvectors back to their original angle.
Quadratic forms
Let be positive-definite. Then, expanding the equation , it can easily be seen that it corresponds to an equation of an ellipse (or rather, ellipsoid), with axes given by the eigenvectors scaled by times the reciprocal of the corresponding eigenvalue. A geometric argument in the same vain as the preceding section is more enlightening.
Let be a spectral decomposition of . Since is positive-definite, all its eigenvalues are positive and so, has a square-root and is also, invertible. Now, the equation becomes . So, the solution of the equation involves all those vectors , which when rotated by , and then scaled by , ends up in the sphere of radius . So, if we scale the points in the sphere of radius back down by , and then rotate them back by , these points should end up in the locus of the equation i.e., When the (inverse) scaling is applied, it should be intuitive that the sphere now becomes an ellipse with the -th axis given by -th canonical basis vector scaled by times the reciprocal of the square-root of . When the rotation is applied, the axes now correspond to the -th eigenvector with the scaling just described.
The geometry of this is shown in for the matrix and . The red circle denotes the unit circle. The blue ellipse is the locus of the quadratic form i.e., it what it is obtained after squishing the unit circle horizontally and stretching it threefold vertically, and then rotating it to the eigenvectors. It is important to note that the major (minor) axis is given by the eigenvector with smallest (largest) eigenvalue. Also, that the reciprocals of eigenvalues are the eigenvalues of the inverse.
Mahalanobis distance
The probability density function of a mulitvariate normal vector of mean and variance is given by Notice that the locus constant density is given by the equation As discussed, these are equations of ellipsoids with axes given by the eigenvectors of and their lengths by times the square-root of the corresponding eigenvalue, however, this time it is centred at . The quantity is the Mahalanobis distance between and which is a statistical generalisation of the Euclidean distance in a sense that will be made clear.
The figure above plots the loci of constant density for four different values. Now, consider the points and . The latter is clearly farther away from the mean in the sense of Euclidean distance. However, statistically is farther away in the sense that is a less likely instance since it is situated in a locus of greater density. It’d be nice if we could transform the space, so that the ellipses become circles as in the figure below, and then measure the Euclidean distance of the points.
From discussion of quadratic forms, this is given by the transformation , where is a spectral decomposition of (not ). So, the transformed Euclidean distance now becomes which is just the Mahalanobis distance.
Principal component analysis
Suppose we have a -dimensional observation from a mulitvariate normal distribution as below.
We want to lower the dimension of to a one-dimensional one as some weighted linear combination , . This linear combination can be understood as the -th coordinate of according to some orthonormal basis in which one of the basis vector is . In general, we may want to collapse into some dimension , , in which case, we write the resultant combination as , where is an orthonormal linear transformation from to . The question is what sort of to choose.
In the case of , if we have multiple observation of unknown variance, then should be chosen to maximise the sample variance of since this would be separating the observations the most, and thus losing as little information as possible. Or, in the case of known variance, we ought to choose as the one that maximises the variance of since this would explain the distribution the most. Looking at the figure, one immediately realises, the variance is maximised at : the eigenvector corresponding to the largest eigenvalue of since this is the axis where varies the mostit’s the widest part of the ellipse. And thus, is the first principal component of .
For , we should choose the second principal component , as the one that maximises the variance same as before, but is wholly independent of since we want to minimise repetition of information as much as possible. Again, looking at the ellipse, one realises should be chosen as the eigenvector corresponding to the second-largest eigenvalue. And we proceed so on and so forth to define all the principal components as , where is the orthonormal matrix of eigenvectors of , arranged in descending order of their eigenvalues.