As promised for some time, today in Uncertainty Wednesday, I will talk about p-values and what makes them so problematic. We will once again look at a super simple example by going back to considering a coin flip. As before we will consider the highest uncertainty explanation which is that the mechanism producing the coin flip produces heads (H) and tails (T) each with probability 0.5 and that each flip is independent.
Now the idea behind p-values is to attempt an argument of reductio ad absurdum but in a setting with uncertainty. We will assume that the explanation is true (this is also called the null hypothesis) and then see if our observations are so unlikely that they amount to a contradiction of our assumption. In the coin example: we will start by assuming that H and T are equally probable, then we will observe a sequence of Hs and Ts and if that sequence is really unlikely given our assumption, then we will reject that assumption.
Now the first question we have to ask ourselves is what does it mean for a sequence to be really unlikely given our assumption? This is an important question because we know that every sequence of a given length is actually equally probable given our assumption of equal probability and independence. What do I mean by that? Let’s take sequences of length 6 for example:
P(HTHHTT) = P(H)*P(T)*P(H)*P(H)*P(T)*P(T) = 0.5*0.5*0.5*0.5*0.5*0.5 = 0.01563
P(TTHHTH) = P(T)*P(T)*P(H)*P(H)*P(T)*P(H) = 0.5*0.5*0.5*0.5*0.5*0.5 = 0.01563
P(HHHHHH) = P(H)*P(H)*P(H)*P(H)*P(H)*P(H) = 0.5*0.5*0.5*0.5*0.5*0.5 = 0.01563
And we see that each of them is equally likely (or unlikely) given our assumptions. So this would not seem to help us much at all!
So how could we distinguish between these sequences? The idea is to compute a statistic, i.e. a condensation of the data. The statistic we might be most interested in here is the sample mean. Let’s say we take H=1 and T=0, then the sample means are as follows:
Mean(HTHHTT) = (1+0+1+1+0+0) / 6 = 3/6 = 0.5
Mean(TTHHTH) = (0+0+1+1+0+1) / 6 = 3/6 = 0.5
Mean(HHHHH) = (1+1+1+1+1+1) / 6 = 6/6 = 1
Now we are getting somewhere. There are many sequences that will give us a mean of 0.5 or close to it. There are only 2 sequences that will give us a mean of 1: HHHHHH and TTTTTT.
The p-value then is defined as the probability of a sample statistic given our explanation (aka assumption, aka null hypothesis). So in our example the p-value of observing mean = 1 is
P(HHHHHH) + P(TTTTTT) = 0.01563 + 0.01563 = 0.03125
Finding a sample mean of 1 given our assumptions has a p-value of 0.03125. That is less than 0.05 which is often used as the cutoff in many studies across fields as diverse as medicine and education. Following that approach we would thus reject our explanation that both H and T are equally probable.
Now all of this sounds super logical. There doesn’t seem to be some obvious error of reasoning. And yet the use of p-values is wildly problematic. Over the next few posts we will explore why.
As “homework,” you might consider the following scenario: you are a researcher who gets paid only if you reject the explanation of equal probability with a p-value cutoff of 0.05. How much work do you have to do to come up with a sequence of observations that gets you the desired result?