David Hugh-Jones: March 2016

So I decided to match words with actions, and correct for multiple testing in the honesty paper. The experience left me feeling a bit ambivalent. Here's what I learned, and some conclusions.

There are many ways to "correct" for multiple testing, and the P values mean different things.

A standard P value means: "suppose the null hypothesis holds. What would the chance be of getting a result like this?" where like this typically means "at least as different from the null". Misunderstanding of P values is widespread, and even the statement above isn't quite right. But, in standard setups, most of us have some intuition what a P value is telling us: a measure of how far the mean of the data is from the null, in terms of the variance of the data and the sample size, all squished into one figure, so we know how (un)likely this result is under the null.

Suppose now you have 20 P values for 20 null hypotheses. If all the nulls are true, you will probably get a P < 0.05 just by the luck of the draw. But how do you want to correct for that?

One question is: "what's the chance of rejecting even one null hypothesis, if all the nulls are true?" So, if you reject a hypothesis when its P value is less than 0.05, you need to adjust those P values upwards somehow, so that, if your nulls are all true, there is no more than a 5% chance of getting any single value below 0.05.

The simplest way to do this is the Bonferroni correction: multiply your P values by 20. This works because, for any events:

Prob(A or B happens) ≤ Prob(A happens) + Prob(B happens) (*)

Applying this:

Prob(any P value < 0.05 under null) ≤ Prob(first P value < 0.05 under null) + ... + Prob(20th P value < 0.05)

So now, if we multiply our P values by 20 on the right, and reject if any corrected P value is less than 0.05, we will be rejecting if any uncorrected P value is less than 0.0025. And assuming that the basic tests are correct, i.e. that the chance of the first P value being 0.0025 or less is indeed 0.0025 under the null:

Prob(any P value < 0.05) ≤ 0.0025 + 0.0025 + ... + 0.0025 = 0.05

OK? Fine. But, two problems.

First, this test is very conservative. It is using that inequality marked (*) above. That inequality only holds with equality if the two events are mutually exclusive. For example, the probability, when rolling a die, of getting an even number or a roll of 4 or more is 4 in 6; the probability of either event on its own is 1 in 2. So, the Bonferroni correction only gives you exact P values if it is impossible to get more than one P value less than 0.05 under the null.

Take an extreme case. Suppose you run the same test twice. Obviously you get the same P value. The chance of getting either P value below 0.05 is just 0.05. If you Bonferroni correct, you are arbitrarily doubling your P values and your chance of getting a corrected P value < 0.05 is 0.025.

Of course you wouldn't do that, but if you run two similar tests - say, tests on the same sample that might have the same kind of error - then you will have the same issue.

The second problem is that it doesn't always make sense to worry about making a single type I error. If I compare 15 countries on some score, I can make 105 possible pairwise comparisons. Do I really want to have less than a 5% chance of getting any star anywhere?

That suggests an alternative way of correcting for P values: to control the "false discovery rate". Correcting this way means: if you reject null hypotheses when they have a corrected P value of less than x%, then on average, no more than x% of your rejected nulls will be true.

But this has problems too. Standard corrections are still conservative. And while significance stars indicating, say, P<0.05, make some sense, it is hard to make much sense out of a specific P values. For example, a P value of 0.03 would mean "of all the P values in this set, not more than 3% would have P < 0.03 under the null". OK, but what do I know about this hypothesis?

As a result of these problems,

Corrected P values are often hard to interpret.

Most scientists can translate roughly between P values and t statistics in their head, and get a sense of what the data looks like. Now imagine a P value corrected for false discovery rate. And bear in mind that the P value is an upper bound. Do you know what it means in terms of the data? I would struggle.

P values are confusing already. Corrected ones can add a new layer of confusion. They need to be explained carefully.

It might be more important to correct for 2 tests than for 20.

When a naive researcher presents a table of 20 results and some significance stars, I know the whole audience is thinking "yeah, right! One out of 20! Big deal!" We know how to deal with that.

The problem is the papers which do just 2 or 3 tests, each presented on its own, and get one or two significant results. But that's already enough to seriously screw up p values. Suppose these tests are independent: the chance of getting at least p<0.05 result in 2 is almost 10%. In 3, 14%. See this famous and funny paper.

Authors need to think about what tests go together.

My paper includes: 2 dependent variables (alternative measures of the same concept). Some tests of whether those two variables correlate at individual level, and if they correlate with self-reports of ethically dubious behaviour (they do) and of own ethical standards (they don't). Some tests for differences between 15 countries. A footnote with tests just for the first eight countries I looked at. (So I can't be accused of collecting data till I got significance, which is another problem!) Some more tests of the differences, adding individual-level controls. A quick check of whether dishonesty correlated with distance from Britain (to check if anti-UK preferences might be driving the result; there was a correlation for one dependent variable only). A bunch of country-level correlations with GDP, trust and corruption. Then, a whole section on beliefs about dishonesty, with some more regressions....

These different tests are doing different things. Some are my key hypotheses. Some are more like robustness checks. Others are put in because a reviewer wanted them. How many tests am I running? Offhand, I don't know, and I'm sure my readers won't either. So, just reporting corrected P values for the whole paper makes no sense. Who cares what proportion of my results would be significant under the null? Or whether any one of them would be significant? What matters is, for each group of key hypotheses – things I really want to claim – how strong is the evidence for that. So, you don't want to correct for everything in your paper together. Group things that belong together as conceptually "a single hypothesis". I mostly did this, doing several different corrections for multiple testing.

Bootstrapping has promise but can be hard to implement.

The List paper I cited last time is an example of a nice method for multiple hypothesis testing which allows your tests to be non-independent (see the example above). A very rough idea is: in your data, reassign treatment/control groups (or dependent variable values) so that the null hypothesis is true. (Like a permutation test.) Run your analyses and look at the p values you get. Do this many times. You will get an idea of the distribution of p values under the null - in particular, how often the lowest p value is less than 0.05. From this you can create a map from "observed p value" to real p value.

This is cute, and potentially gives non-conservative p values, but it clearly requires a lot of work to do, especially if you haven't thought about it from the start. Which brings me to the last point:

Write your analysis with multiple testing in mind.

If you have sense, you're already using something like knitr to get your statistics into your paper. But if you then want to go back and correct your p values, you need to collect all the p values in one place and then correct them. Doing this could entail a lot of rewriting of your code. Bear this in mind at the start of the analysis, so that you produce your p values in one place.

Apologies for the length of this post. I'll try to be shorter in future (and write about more enjoyable topics....)

David Hugh-Jones

Thursday 10 March 2016

I corrected for multiple testing and lived