David Hugh-Jones: September 2017

Wednesday, 20 September 2017

Area genetics

Here is an interesting map of the UK.

The colours relate to the genetics of people born in each county. Specifically, they show you the average Educational Attainment Polygenic Score (EA PS) of residents from within our sample. EA PS is a DNA measure that can be used to predict a person's level of education (e.g. do they leave school at 16, or get a university degree). Red is the worst, pale yellow is the best.

The black outline shows areas of former coalmining. Coal employment has been declining since the 1920s, and by the 1970s, these areas were often socially deprived.
I won't say much more for now!

Tuesday, 19 September 2017

We preregistered an experiment and lived

For my school experiment with Jinnie, we decided to pre-register our analyses. That seemed like the modern and scientifically rigorous thing to do.

There are different preregistration venues for economists. osf.io is very complete and allows you to upload many kinds of resources, then "freeze" them as a public preregistration. aspredicted.org is at the other extreme, it justs asks you 9 questions about your project. The AEA also runs a registry for randomized controlled trials at www.socialscienceregistry.org.

For this project, we decided to use osf.io. We were pretty serious. We uploaded not just a description of our plans, but exact computer code for what we wanted to do. Here's our preregisration on osf.io.

This was the first time I have preregistered a project. We ran into a few hurdles:

We preregistered too late, after we'd already collected data.

This was pure procrastination and lack of planning on our part. Of course it means that we could have run 100 analyses, then preregistered the analysis that worked.

Our preregistered code had bugs.

This was true even though it worked on the fake data we'd used to test it. Luckily we were able to upload a corrected version, but if you've frozen the files you uploaded, this would be a problem.

Our analysis was not the right one.

The data looked odd and our results weren't significant! Now we faced a dilemma. The correct thing to do would be to admit defeat. You preregister, your results are insignificant... go home. However, it was also reasonably clear that we had assumed our dependent variable would look one way (a nice, normally-ish distributed variable), and in fact it looked completely different (huge spikes at certain values, some weird and very influential outliers).

We were sure that statistically, we should do a different analysis. But of course, then we were in the famous garden of forking paths. So we compromised: we changed the approach, but added an appendix with our initial analysis, and retrying it with some fairly minimal changes (e.g. removing outliers). In fact, even just clustering our standard errors appropriately would give us a significant result, though again, that wasn't in the original plan.

Bottom line: you are an imperfect researcher. Your initial plan may just be mistaken, and as you think about your project, you may improve on it. Your code may fail. And the data may reveal that your assumptions were wrong. These can raise awkward choices. It is easy to convince yourself that your new analysis, which just happens to get that coveted significance star, is better than your original plan.

Despite these problems, I'm glad we preregistered. This did discipline our analysis. We've tried to keep a clear separation between questions in our analysis plan; and exploratory questions which we thought of later, or which seminar participants suggested to us. For example, we have a result where children are more influential on each other if they have many shared friends. Interesting, and it kind of makes sense among our adolescent subjecs, but it is exploratory. So, I'd want to see it replicated elsewhere before being fully persuaded this was a real result. By contrast, I am quite confident in our main result, which follows the spirit though not the letter of our plan.

In many cases, preregistering one's code may be over the top. It's better to state clearly and accurately the specific hypotheses you're going to test. There's no way you can be fully specific, but that's fine – the goal is to reduce your degrees of freedom by a reasonable amount. So, I would probably favour the quick aspredicted.org style, over the more complex osf.io style, unless I was running a really involved project.

I've just preregistered an observational analysis of some genetic data. It's over at aspredicted.org, number 5584. Just waiting for my authors to approve...