We’ve talked about this magical tool called PCA a couple of times that helps us reduce the dimensionality of our data. We’ve only talked about what it does, but not how it does so, and what the motivation behind it is. We’ll demystify it all in this blog post.
Let’s first talk a little linear algebra. PCA is based on linear algebra concepts, so we can’t shy away from it, unfortunately. In linear algebra, we talk about linear spaces, which are also called vector spaces. Each linear space has a dimension, but I will not go into the details of that here. If you are interested, however, Hoffman & Kunze’s “Linear Algebra” is an excellent reference. Dimensions are what you think they are intuitively, but I’ll give a simple example over here.
Let’s consider the two-dimensional linear space of real numbers. This is simply the xy-plane that you’re probably used to. Let me ask you a question–can you think of combination of vectors, or points, such that if you take their linear combination, you can form the entire 2D space? A linear combination of two vectors x and y is , for any choice of constants and . So basically, here’s what I’m asking. Can you find out a set of points, such that if I give you any random point, say (2, 3.5), you’ll be able to find some unique constants and so that the linear combination of your set of points gives me (2, 3.5)? One easy choice is a unit vector in the x and the y directions. So we could take (0, 1) and (1, 0). So if you gave me (2, 3.5), then , and you see the way we’ve done the multiplication and addition there. You should prove to yourself that given any point on the 2D space, you can easily find constants to multiply with this choice of vectors to get that point. We call such a set of points, a basis, and the basis is said to span the entire 2D linear space.
A basis for a linear space is a minimal set of vectors in the linear space whose span is the complete space itself.
This isn’t the only basis for the 2D linear space, but that’s not our point here. So each linear space has a basis. The number of vectors in the basis is called the dimension of the linear space. Notice how we have two vectors here, and that’s exactly the dimension of the space that we’ve chosen. So there’s math behind the word “dimension” that we so casually throw around (next time when we talk about the dimensionality of the data, think linear spaces in -dimensions!).
Alright, one last concept. A subspace, also called a manifold, is a subset of a linear space. For example, we could have a linear subspace for the 2D linear space by restricting the second component to zero. And if you think about it, now you’re really just stuck with real numbers since the second component isn’t very helpful–and real numbers form a 1D linear space. Another example would be to put the restriction that both the components must be equal. These two examples were taken from Wikipedia’s excellent article on subspaces.
Let’s now get back to PCA. PCA tries to find a linear subspace of the data that you currently have. So starting from a large set of attributes, you shrink it using PCA to a small set of attributes, which PCA calls the principal components. PCA assumes that these principal components are orthogonal to each other, and while this may not always be the case, in practice, PCA usually works reasonably well.
PCA transforms a set of vectors in a linear space to an orthogonal basis in a subspace.
An orthogonal basis is simply a basis where each vector is orthogonal to every other. But that’s not the only great thing about PCA. The principal components that it gives you are in descending order of the variance explained. What does this mean? It means that when you take a point in the original linear space and project it to the subspace, then along the first principal component, it will be spread the most as compared to the others. Here’s an example from Wikipedia:
This is the result of a PCA to two dimensions. The data in the original space has been projected to two dimensions. The two black arrows are the principal components. See how there’s more spread along one of these? That one is the first principal component, because maximum variance is explained in that direction. The second principal component, as you can see, explains a lot less variance. Another way to think of PCA is as a change of basis from one linear space to a subspace. The principal components that you get form a basis for the lower dimensional subspace.
Another benefit of PCA is that it gives us features that are uncorrelated (well, that’s the assumption, anyway). Why are independent features so desirable? Because independent features form a basis for the vector space they are in. So if you have say one extra feature that isn’t independent, then it isn’t really helping you, because you now have features, but the span of these features is only -dimensional.
Now that we know what PCA is and why it’s so useful, let’s delve into the details. This part is from Lindsay Smith’s tutorial on PCA.
Let’s generate some data.
from sklearn.datasets import make_regression
import numpy as np
import seaborn as sns
X, Y = make_regression(n_samples=500, n_features=2, n_informative=2, noise=10, effective_rank=1)
Our first step will be to subtract the means along each axis.
X -= np.mean(X, axis=0)
In our case, the data seems to have already been very nearly centered, so it won’t make too much of a difference. Next, let’s compute the covariance matrix of this centered data. The covariance matrix is a matrix that gives the pairwise covariances between every dimension. Since this data is 2 dimensional, the covariance matrix will be 2 x 2. We could easily do this via the library, but just to show you that this is actually really simple, we’ll do it on our own. The covariance matrix for a matrix is given by
The denominator is comes from statistics, and is called Bessel’s correction. The next step is to compute the eigenvectors and eigenvalues of the covariance matrix. For any matrix or linear transformation , the eigenvectors and eigenvalues are the vectors and corresponding scalars such that . We will resort to using libraries for this purpose. It is important that the eigenvalues that you get should be normalized. A lot of math packages do it by default, but you should know that this is happening behind the scenes.
C = np.dot(X.T, X) / 499
lambd, eig = np.linalg.eig(C)
Notice that we divided by 499, one less than the number of points (500). That’s it! The eigenvectors are the principal components. Now the problem, obviously, is that we have two eigenvectors, so we have two principal components…so we haven’t really done any of the advertised dimensionality reduction.
Here we can pick our features. We can either choose both the eigenvectors, or throw away the one that explains less variance. How do we know which eigenvector explains less variance? This is the one with the lesser eigenvalue corresponding to it. In my case, the first eigenvalue is smaller, so the second one explains more variance. Note that the eigenvectors returned by np.linalg.eig are along the columns, not the rows. Finally, let’s arrange our eigenvector matrix so that the first column actually has the eigenvector that explains the most variance (courtesy of this SO answer).
eig[:, 0], eig[:, 1] = eig[:, 1], eig[:, 0].copy()
Let’s plot the eigenvectors (principal components) along with the data. The code is a slightly modified version of this Stack Overflow answer.
mu = X.mean(axis=0)
projected_data = np.dot(X, eig)
sigma = projected_data.std(axis=0)
fig, ax = plt.subplots()
colors = ['blue', 'green']
for i, axis in enumerate(eig):
start, end = mu, mu + sigma[i] * axis
ax.annotate('', xy=end, xytext=start,
This code takes advantage of the way ax.annotate works. This function, when passed the arrowprops parameter, draws an arrow between the xy and the xytext parameters. These parameters, in this code, are the mean of the data (which is the origin since we centered it) and a point that’s along each eigenvector, whose length is proportional to the standard deviation along that axis.
And this is exactly what we’d expect. We can visually see that the first principal component (blue) is in the direction of the maximum variance, and the second principal component (green) is in the direction of the next maximum variance. I did this using the built-in PCA class in sklearn.decomposition, and got the exact same plot and principal components.
Finally, let’s discuss how to reduce the dimensionality. Suppose we want to reduce the dimensionality of our data to one. In this case, all we’d have to do is project the data along the first principal component (which is the first column of our eigenvector matrix). We can do this quite easily:
proj = np.dot(X, eig[:,0])
On comparing these values with the ones using the built-in PCA class’ transform method, all the values were the same.
This blog post talked about a widely used dimensionality reduction technique, principal components analysis, explained the math behind it, and showed a Python implementation for the technique. In practice, you will use the built-in method, but it’s nice to know exactly what it does and the mathematical basis for it.
You should note that the reduced features you get from PCA may not be orthogonal. This is an assumption made by PCA, but it’s not necessarily true. In fact, the reduced features may not even be independent, and independence is only guaranteed if the data has a Gaussian distribution. However, PCA does work surprisingly well in practice.
Dr. Saha in his notes discusses another algorithm, Singular Value Decomposition (SVD), and its relation to PCA. You can read them at this link. There’s also his notes on linear algebra, available at this link.