In the last Uncertainty Wednesday post on Sample Variance, I wrote that “Inference from data without explanations is how people go deeply wrong about reality.” It occurred to me that the best way to illustrate this is by writing about spurious correlation. To do that I first have to introduce the concept of correlation though. It may seem surprising that I have gotten this far into the series without doing so, but we spent a fair bit of time on a related concept, namely independence.
If you don’t recall, you should go back and read the posts on independence. The opposite of independence of two (or more) random variables is dependence. Now this is where it gets confusing. Sometimes the word “correlation” is used as a synonym for “dependence.” But more commonly “correlation” refers to a measure of a specific type of dependence, namely linear dependence.
The Wikipedia entry on correlation and dependence has a wonderful graphic illustrating what the so-called Pearson correlation coefficient does and does not measure:
The top row shows how the correlation coefficient ranges between +1 (perfect positive linear correlation) and -1 (perfect negative correlation) and decreases as the two random variables become less dependent. It becomes 0 in the middle when they are independent.
The second row deals with a common misconception: the correlation coefficient does not in fact measure the slope of the relationship. It just measures the strength. So different slopes but perfectly correlated results in a coefficient of +1 or -1.
The third row in turn shows that there can be very clear cases of dependence, which are immediately visually evident and yet correlation coefficient, as a measure of linear dependence, is 0.
All of this is to say that correlation, as commonly used, is a highly specific measure of dependence. And yet correlation turns out to be widely used. As we will see much of that is in fact abuse.
Now you might have heard the expression “correlation does not mean causation.” We will get to that also, but what we are after first is “correlation does not even mean correlation.”
Huh? What do I mean? Well, as you have seen from the posts on sample mean and sample variance, whenever you are dealing with a sample the observed values of statistical measure have their own distribution. The same is of course true for correlation. So two random variable may be completely independent, but when you draw a sample, the sample happens to have correlation. That is known as spurious correlation.
Next Uncertainty Wednesday, we will look at some concrete examples of that, which will really drive home the point about the need for explanations.