Uncertainty Wednesday: Misunderstanding Sample Correlation (Fat Tails)

Today’s Uncertainty Wednesday revisits a favorite topic of mine: correlation. I first wrote about the importance of thinking about correlation in modeling over 10 years ago, long before starting the Uncertainty Wednesday series (the reference to Excel is a giveaway). I then had a three part series about spurious correlation which you can find here: part 1, part 2 and part 3. Here is the key introductory paragraph:

Well, as you have seen from the posts on sample mean and sample variance, whenever you are dealing with a sample the observed values of statistical measure have their own distribution. The same is of course true for correlation. So two random variable may be completely independent, but when you draw a sample, the sample happens to have correlation. That is known as spurious correlation.

But the situation is way worse than that when one of the random variables involved has a fat tailed distribution in which extremes occur with higher probability than say a normal distribution. Why? Because many fat tailed distributions do not have a well-defined variance. Instead their variance explodes towards infinity. Yet any sample from a fat tailed distribution will have a finite variance (by construction). The sample variance in this situation is not an estimate of the actual variance – since the latter does not exist. By extension, a correlation in which at least one variable is fat tailed has the same problem.

This should be a complete “Duh” moment and yet people cannot help themselves but use sample correlation all the time in settings where fat tails are extremely likely. Again, the sample correlation will always exist (by construction), but it doesn’t have to mean a thing! This is incredibly hard for us to accept: we follow a recipe (how to calculate correlation), we get a number (sample correlation) and yet we are supposed to ignore it?

Here is a way of understanding why. Ask yourself what would happen to the correlation if you had more data points. This is an important mental exercise as it is a counter-factual (you only have your existing data points). Sometimes a deeper truth can only be arrived at by realizing that your sample is misleading you.

Let’s look at a concrete example: outcomes in venture investing are generally seen to be fat tailed. You have a sample of venture fund sizes and returns. The sample shows negative correlation – greater fund sizes appear correlated with lower return. Can you actually conclude that this is the case? You have to ask yourself what would happen if just one large fund completely hit it out of the park? Or maybe two funds did. Would the sign on your correlation flip to positive?

If outcomes really are fat tailed you will find that your sample correlation is not really robust (you could actually simulate this by “drawing” fro the distribution an adding new hypothetical points to your sample and then recalculating the sample correlation). This also turns out to be a central argument that Nassim Taleb made in his recent criticism of IQ as a valid concept. As per usual his criticism leaves lots of room for further debate, but I have yet to see a response that at least attempts to address this problem of sample correlation under fat tails.