Uncertainty Wednesday: Misunderstanding Sample Correlation (Fat Tails)

Today’s Uncertainty Wednesday revisits a favorite topic of mine: correlation. I first wrote about the importance of thinking about correlation in modeling over 10 years ago, long before starting the Uncertainty Wednesday series (the reference to Excel is a giveaway). I then had a three part series about spurious correlation which you can find here: part 1, part 2 and part 3. Here is the key introductory paragraph:

Well, as you have seen from the posts on sample mean and sample variance, whenever you are dealing with a sample the observed values of statistical measure have their own distribution. The same is of course true for correlation. So two random variable may be completely independent, but when you draw a sample, the sample happens to have correlation. That is known as spurious correlation.

But the situation is way worse than that when one of the random variables involved has a fat tailed distribution in which extremes occur with higher probability than say a normal distribution. Why? Because many fat tailed distributions do not have a well-defined variance. Instead their variance explodes towards infinity. Yet any sample from a fat tailed distribution will have a finite variance (by construction). The sample variance in this situation is not an estimate of the actual variance – since the latter does not exist. By extension, a correlation in which at least one variable is fat tailed has the same problem.

This should be a complete “Duh” moment and yet people cannot help themselves but use sample correlation all the time in settings where fat tails are extremely likely. Again, the sample correlation will always exist (by construction), but it doesn’t have to mean a thing! This is incredibly hard for us to accept: we follow a recipe (how to calculate correlation), we get a number (sample correlation) and yet we are supposed to ignore it?

Here is a way of understanding why. Ask yourself what would happen to the correlation if you had more data points. This is an important mental exercise as it is a counter-factual (you only have your existing data points). Sometimes a deeper truth can only be arrived at by realizing that your sample is misleading you.

Let’s look at a concrete example: outcomes in venture investing are generally seen to be fat tailed. You have a sample of venture fund sizes and returns. The sample shows negative correlation – greater fund sizes appear correlated with lower return. Can you actually conclude that this is the case? You have to ask yourself what would happen if just one large fund completely hit it out of the park? Or maybe two funds did. Would the sign on your correlation flip to positive?

If outcomes really are fat tailed you will find that your sample correlation is not really robust (you could actually simulate this by “drawing” fro the distribution an adding new hypothetical points to your sample and then recalculating the sample correlation). This also turns out to be a central argument that Nassim Taleb made in his recent criticism of IQ as a valid concept. As per usual his criticism leaves lots of room for further debate, but I have yet to see a response that at least attempts to address this problem of sample correlation under fat tails.

More from Continuations

Continuations

Feb 4

Philosophy Mondays: Human-AI Collaboration

Today's Philosophy Monday is an important interlude. I want to reveal that I have not been writing the posts in this series entirely by myself. Instead I have been working with Claude, not just for the graphic illustrations, but also for the text. My method has been to write a rough draft and then ask Claude for improvement suggestions. I will expand this collaboration to other intelligences going forward, including open source models such as Llama and DeepSeek. I will also explore other moda...

Cover image for Intent-based Collaboration Environments

Continuations

Dec 30

Intent-based Collaboration Environments

AI Native IDEs for Code, Engineering, Science

Continuations

Dec 29

Web3/Crypto: Why Bother?

One thing that keeps surprising me is how quite a few people see absolutely nothing redeeming in web3 (née crypto). Maybe this is their genuine belief. Maybe it is a reaction to the extreme boosterism of some proponents who present web3 as bringing about a libertarian nirvana. From early on I have tried to provide a more rounded perspective, pointing to both the good and the bad that can come from it as in my talks at the Blockstack Summits. Today, however, I want to attempt to provide a coge...

Well, as you have seen from the posts on sample mean and sample variance, whenever you are dealing with a sample the observed values of statistical measure have their own distribution. The same is of course true for correlation. So two random variable may be completely independent, but when you draw a sample, the sample happens to have correlation. That is known as spurious correlation.