Uncertainty Wednesday: The Problem with P-Values (Generating Hypotheses)

Today’s Uncertainty Wednesday continues our exploration into p-values and why they are problematic. Last Wednesday we saw that if you have incentives to reject a null hypothesis, it takes less work than you would initially think to find data that gets you there. I ended that post suggesting that the problem is even bigger than that. How so?

We now live in the age of “big data” – researchers in many fields have access to massive data sets. This lends itself to an approach that has become known as “data dredging.” Instead of starting with the null hypothesis of a “fair and independent coin” we start with a large database of pre-recorded coin flips. Now we work backwards to find a hypothesis that we can reject with a p-value of 0.05 or maybe even 0.01 in our data set!

How would we do such a thing and what would such a hypothesis look like? Well with a dataset containing just Hs and Ts we would have to be a bit creative. But we could generate hypotheses that take the form of a probabilistic finite state machine. For instance: the coin first has a probability of 20% H and 80% T, if H it has a subsequent probability of 70% H again, but if T then it only has a 10% of repeating T. You get the idea. You could write computer code that generates such hypotheses until you find one that you can reject with a really significant p-value in your dataset. Then you go and publish!

Now you might object: Albert, these are completely arbitrary hypotheses, why would anyone believe these? Well, they only come across as arbitrary because I on purpose stayed within the domain of a coin flip. But most big dataset are really complex containing many different variables. Just take the coin flip database and combine it with a database of stock price fluctuations. Now you can test tons of different hypotheses of the form: price movements for stock x are not correlated with the coin flips (where H might be stock price for x moves up and T it moves down).

Again you can have your computer generate these hypotheses for you and test them until you find one you can reject with a p-value that’s deemed significant. These hypotheses are just as arbitrary as the coin state machines I suggested above, but they don’t look that way. They look really simple and thus credible.

But this approach completely violates the statistical reasoning behind p-values. That reasoning only applies if you start with the hypothesis and then apply the test. In any large dataset you will always be able to work backwards towards hypotheses that can be rejected *in that dataset*. Just recall the prior posts about spurious correlation.

OK, so that’s pretty bad given that so many people have incentives to find hypotheses they can reject so that they can publish a paper or claim that a product is effective. But next Wednesday we will look into an even more profound problem with p-values.