Today’s Uncertainty Wednesday will be the concluding post in my mini-series on the problem with p-values. We have already seen that it is much easier than expected to reject a null hypothesis if you have incentives to do so. We also saw that the ability to work backwards and generate hypotheses from the data is a big issue. Today we will consider a more foundational, epistemological problem with p-values: what is it that we are really learning when we are rejecting a null hypothesis?
Let’s once again consider the original example of a coin toss where our null hypothesis is that the coin is fair (and independent). We have done everything by the book. We had our null hypothesis ahead of time (not generated from the data). We did exactly 6 tosses and they all came up as heads (or tails for that matter), instead of cheating on our data collection. And so with great satisfaction we reject the null hypothesis at a p-value of 0.03125.
But what does that actually mean? What have we really learned from doing so? Our null hypothesis here is incredibly narrow. It is that the coin is precisely fair. Rejecting that leaves open a ton of other possibilities. Is the coin just slightly unfair or is it extremely unfair? Which of these two possibilities is more likely given what we have observed? And why did we pick this narrow null hypothesis in the first place?
Let’s take a step back. Suppose I don’t tell you that we are dealing with a coin, just with a process that has two possible observable signals H and T. If you know nothing else about the process, that allows for anything from observing only Hs to only Ts to some random mix of the two. That makes it clear that having as your null hypothesis that the mix will be random at exactly 50% Hs and 50% Ts is an incredibly narrow assumption. It is picking a single real number, 0.5, on a continuous interval from 0 (no Hs) to 1 (all Hs).
This is related to the issue we encountered previously with spurious correlation. A null hypothesis of zero correlation between two variables is an incredibly narrow assumption, when possible correlation is a continuous interval from -1 to +1. So again, when we reject that narrow hypothesis what have we actually learned? Only that some very narrowly defined assumption is unlikely. That’s not a lot of learning.
This is a fundamental limitation of the p-values approach. Generally people tend to pick very narrow null hypotheses and rejecting them doesn’t tell us much about the alternatives. Now this can be seen as a slightly unfair criticism. If you get a p-value of 0.0000001 on a coin toss and you do it with a large number of tosses you have the information that the coin is likely to be very unfair. But with the p-values approach that additional step tends to be buried.
What is the alternative? The alternative is to take a Baeysian approach instead. We saw that already in the case of correlation how that provides a lot more information than the rejection of a null hypothesis.