PCA dimension reduction and classification don’t play well together

Updated: May 17, 2019

Sometimes combining two things, ideas, devices, code-bases, etc., provides a tangible benefit.  A conglomerate product can save you time, or make your life more enjoyable or convenient.  Combining “Shampoo and Conditioner” into a single product is a classic example - it saves you time in the shower.

A couch that rolls out into a bed - another classic.  Combined washer and dryer - ditto.

But sometimes a 2-in-1er isn’t more convenient, and is actually a net negative. Take, for example, the smartphone-case / fidget spinner combo.  Not sure why this exists, but it does.

Another example - I had a great idea for a 2-in-1 product the other day: the toilet brush / plunger combo. At first I thought it seemed like a good idea - you need only one tool for your commode, instead of two. But then I thought a bit about how I'd have to use it, and made a quick mockup of the device. I quickly realized it was a no-go disaster of an idea.

The 2-in-1 toilet brush / plunger. Convenience or terror?

My Frankenstein creation reminds me of an equally freakish combo I often see people who work in data science use (although they shouldn't): Principal Component Analysis (PCA) based dimension reduction + classification.   

You see people using this combo all the time out ‘in the wild’ for dealing with high dimensional data (e.g., image or genetic data) that they need to classify.  At first it makes sense: both PCA dimension reduction and classification are useful tools on their own, so why not combine them?  

The reason is that - when you think about it a bit - the idea can fail as easily as my toilet plunger / brush combo.  It can actually ruin your ability to learn an accurate classifier with high dimensional data.

A simple example illustrates the point. The animation below shows a two-class toy dataset - one that is easily separable by a linear decision boundary in two dimensions. It's an easy dataset to deal with. But if we cut the dimension of the data in half by first applying PCA to reduce the input dimension from 2 to 1 the wheels fall off: the data is no longer so easily separable.

PCA dimension reduction can jumble up classification data, making it more difficult to classify correctly.

First the one-dimensional subspace provided by the top principal component of the data (solid black) is shown. Then we project the data onto that subspace - and doing so jumbles up the two classes. You can play with the Jupyter notebook that generated this animations here.

In other words, in this instance, by applying PCA dimension reduction first I made a simple classification problem way more difficult. It's no longer separable by a linear decision boundary when its dimension was reduced via PCA.  This won’t always happen - but as this simple example illustrates, it's certainly possible.

If we wanted to cut the dimension of our data in half before classifying, we would have actually been better off projecting the data onto a random subspace.  This is animated below using some random line rather than the one provided by PCA.

The PCA projection is often inferior to a random one, in terms of preserving separation in classification data.

Even this random projection provides better separation on the lower dimension version of the data. A random projection won’t always do the trick - sometimes we’ll find one that mixes up the two classes like the PCA subspace does (as shown below).  

A random projection doesn't always do the trick.

Nonetheless, at least sometimes we get a better result.  There are actually some interesting theoretical results that show the value of such random projections mathematically (see e.g., this review paper).

So - in short - it can be quite dangerous to use PCA dimension reduction with classification. It seems like a good idea at first, but it's not worth the risk of destroying class separation.  It's like a 2-in-1 toilet plunger / brush.  It messes things up.

Now that’s not to say that you should never use PCA with classification - you should just be careful when using it to reduce the dimension of classification data.  

Using it as a pre-processing technique, typically referred to as PCA-sphering, is not a bad idea (provided you can compute the SVD of your high dimensional dataset. Otherwise a better choice for really high dimensional data is standard normalization, a technique that can be used effectively regardless of input dimension). PCA-sphering doesn’t involve reducing the dimension of your data, but conditioning it so that optimization is easier to perform.  

Even so, the main hurdle when classifying high dimensional data isn’t how to reduce its dimension properly but how to carefully cross-validate a linear classifier over it using K-folds:  a story for another day.