Uncertainty Wednesday: Spurious Correlation (Part 3)

So the last two Uncertainty Wednesday posts have been about spurious correlation. Today, I want to give an example of easy it is to observe spurious correlation. To that end I wrote a little Python program which I will show below that rolls two independent dice. Each is rolled 10 times to give us two data series of 10 points each. This mimics the 10 data point series from last week’s example.

The program runs some number of these and each times calculates and outputs the coefficient of correlation. I then use Google Sheets to produce a histogram.

Here is the result for 1,000 runs

And here are 10,000 runs

What we see is the distribution of the sample correlation. As we add more runs we once again see a normal distribution emerge (isn’t that fascinating?).

Looking at the charts we see that the center of the distribution is 0, which is reassuring as it suggests that the two sets of dice rolls were in fact independent. But we also see that the bulk of the histogram is for values of the correlation that are, well, not zero. Put differently, you are much more likely to observe a non-zero correlation than a zero correlation in these samples.

Now you might say though, even with 10,000 runs we don’t see correlations above 0.75 or below -0.8, but the divorce rate example had a correlation of 0.99. So isn’t that extremely unlikely?

Keep in mind that the correlation in the example was between two basically arbitrary data series of 10 points each. There are, well, billions (actually infinitely many) such series. So 10,000 runs is really nothing. So I changed the program to 1 million runs. The maximum observed correlation on that was 0.9727, much closer to the 0.99 in the example. In fact the minimum observed correlation on that run was -0.9902!

But this direction of analysis is the wrong direction. We are assuming independence (remember this is a strong assumption) and then checking how likely it is to observe a specific correlation. What we really want to figure out is what our updated belief about correlation should be depending on a less restrictive prior.

Interestingly, this takes me to the edge of my own knowledge and so I have asked an expert in Bayesian estimation to help me. Stay tuned!

PS Here is the Python code: