So I decided to match words with actions, and correct for multiple testing in the honesty paper. The experience left me feeling a bit ambivalent. Here's what I learned, and some conclusions.
- There are many ways to "correct" for multiple testing, and the P values mean different things.
Suppose now you have 20 P values for 20 null hypotheses. If all the nulls are true, you will probably get a P < 0.05 just by the luck of the draw. But how do you want to correct for that?
One question is: "what's the chance of rejecting even one null hypothesis, if all the nulls are true?" So, if you reject a hypothesis when its P value is less than 0.05, you need to adjust those P values upwards somehow, so that, if your nulls are all true, there is no more than a 5% chance of getting any single value below 0.05.
The simplest way to do this is the Bonferroni correction: multiply your P values by 20. This works because, for any events:
Prob(A or B happens) ≤ Prob(A happens) + Prob(B happens) (*)
Applying this:
Prob(any P value < 0.05 under null) ≤ Prob(first P value < 0.05 under null) + ... + Prob(20th P value < 0.05)
So now, if we multiply our P values by 20 on the right, and reject if any corrected P value is less than 0.05, we will be rejecting if any uncorrected P value is less than 0.0025. And assuming that the basic tests are correct, i.e. that the chance of the first P value being 0.0025 or less is indeed 0.0025 under the null:
Prob(any P value < 0.05) ≤ 0.0025 + 0.0025 + ... + 0.0025 = 0.05
OK? Fine. But, two problems.
First, this test is very conservative. It is using that inequality marked (*) above. That inequality only holds with equality if the two events are mutually exclusive. For example, the probability, when rolling a die, of getting an even number or a roll of 4 or more is 4 in 6; the probability of either event on its own is 1 in 2. So, the Bonferroni correction only gives you exact P values if it is impossible to get more than one P value less than 0.05 under the null.
Take an extreme case. Suppose you run the same test twice. Obviously you get the same P value. The chance of getting either P value below 0.05 is just 0.05. If you Bonferroni correct, you are arbitrarily doubling your P values and your chance of getting a corrected P value < 0.05 is 0.025.
Of course you wouldn't do that, but if you run two similar tests - say, tests on the same sample that might have the same kind of error - then you will have the same issue.
The second problem is that it doesn't always make sense to worry about making a single type I error. If I compare 15 countries on some score, I can make 105 possible pairwise comparisons. Do I really want to have less than a 5% chance of getting any star anywhere?
That suggests an alternative way of correcting for P values: to control the "false discovery rate". Correcting this way means: if you reject null hypotheses when they have a corrected P value of less than x%, then on average, no more than x% of your rejected nulls will be true.
But this has problems too. Standard corrections are still conservative. And while significance stars indicating, say, P<0.05, make some sense, it is hard to make much sense out of a specific P values. For example, a P value of 0.03 would mean "of all the P values in this set, not more than 3% would have P < 0.03 under the null". OK, but what do I know about this hypothesis?
As a result of these problems,
- Corrected P values are often hard to interpret.
P values are confusing already. Corrected ones can add a new layer of confusion. They need to be explained carefully.
- It might be more important to correct for 2 tests than for 20.
The problem is the papers which do just 2 or 3 tests, each presented on its own, and get one or two significant results. But that's already enough to seriously screw up p values. Suppose these tests are independent: the chance of getting at least p<0.05 result in 2 is almost 10%. In 3, 14%. See this famous and funny paper.
- Authors need to think about what tests go together.
These different tests are doing different things. Some are my key hypotheses. Some are more like robustness checks. Others are put in because a reviewer wanted them. How many tests am I running? Offhand, I don't know, and I'm sure my readers won't either. So, just reporting corrected P values for the whole paper makes no sense. Who cares what proportion of my results would be significant under the null? Or whether any one of them would be significant? What matters is, for each group of key hypotheses – things I really want to claim – how strong is the evidence for that. So, you don't want to correct for everything in your paper together. Group things that belong together as conceptually "a single hypothesis". I mostly did this, doing several different corrections for multiple testing.
- Bootstrapping has promise but can be hard to implement.
This is cute, and potentially gives non-conservative p values, but it clearly requires a lot of work to do, especially if you haven't thought about it from the start. Which brings me to the last point:
- Write your analysis with multiple testing in mind.
Apologies for the length of this post. I'll try to be shorter in future (and write about more enjoyable topics....)