Last Uncertainty Wednesday, I introduced the topic of spurious correlation. Since then I have discovered a site that gives some fantastic examples of (potentially) spurious correlations. Here is one:
The coefficient of correlation is 0.9926, i.e. almost 1 (which would be perfectly correlated).
Let’s remind ourselves what finding correlation in a sample of data means. It is simply a numerical measure that can be computed for any paired data. The formula produces a result that has nothing to with the labels on the data. This may seem like stating the obvious, but it is really important to keep in mind. The numerical result for correlation here is the same whether the labels read “Divorce rate in Maine” and “Per capita consumption of margarine” or if they were simply “Series 1″ and “Series 2.″
Why am I emphasizing this? Because whether or not we think sample correlation is indicative of real correlation is something we need to decide based on our explanations about the relationship between the two random variables. I don’t have an explanation relating “Per capita consumption of margarine” to the “Divorce rate in Maine.” Saying that I don’t have an explanation, importantly, though is not the same as saying that they are definitely independent (you may recall that independence is actually quite a strong assumption). Margarine consumption and divorce rates are both household behaviors and so it is quite possible for them to be dependent!
Next Uncertainty Wednesday we will take a deeper look into how much of a signal of real correlation we are getting depends on our prior believes (based on explanations) of the actual correlation.