Learning to Replicate | Nathan M. Jensen

Right now the internet is blowing up with more news on the retracted LaCour and Green study.

But this post is mostly about a conference I attended today on replication in the social sciences hosted by the International Institute for Impact Evaluation (3ie). The conference followed Chatham House rules, essentially asking us to not attribute any of the comments to individuals.

The conference was mostly focused on development economics. 3ie has an ambitious replication initiative focused on a handful of influential papers. What I found really interesting is the tone of the discussion in this audience versus most of my interactions with political scientists. Numerous commentators make direct or indirect claims that the editors of the (econ) journals were the problem. They either had no replication policy or didn’t enforce their replication standards. One overview of the status of replication in economics is here.

As a political scientist, I felt like I was going back in time. I haven’t checked every journal, but I was under the impression that just about every major political science journal has a replication policy in place.

Now I checked. There is a damning study claiming only 18 of 180 political science journals have replication policies. The good news is that the American Political Science Review, American Journal of Political Science, and Journal of Politics all have replication policies and that impact factor is one of the best predictors of a replication policy. But a few excellent journals do not.

Many journals now require archived data as a condition for publication and a few journals like Political Science Research and Methods and the American Journal of Political Science require a technical replication before publishing. My prediction is that within the next five year most of the major political science journals will require replication as a condition for publication.

I don’t have too much more to say about this, other than there seems to be major cultural differences in views on replication across fields that affects data sharing.

A number of presenters discussed the different types (and terms) of replication. Some of this seemed like inside baseball until we began discussing Michael Clemens paper on defining replication. I think the most important idea here is that the term “replication” carries a lot of baggage for the original researcher. Individuals (and journals) should encourage “replication” but be very careful in labeling a study as “failing to replicate”. For example, if a scholar made up the data, that is clearly a failure to replicate. But what if a study of land titling in country X doesn’t yield the same results as an identical land titling study as country Y. Labeling the original study as a failure to replicate suggests wrongdoing by the original author.

For those of us that have followed some of the nasty exchanges between original authors and authors of replication studies (some of these 3ie studies) it is easy to see how the threat of a label of “failed replication” could lead to defensiveness. Many of the discussants highlighted the importance of engaging the original authors in a dialogue. Our goal shouldn’t be to catch mistakes by authors. It is to correct them to help us learn more about the world.

One of the most thought provoking presentation was by Brian Nosek on the work being done through the Open Science Framework. Brian presented a number of replication studies. This includes:

The Reproducibility Project: Using a single year (2008) and three top psychology journals, 270 authors attempted to replicate 100 studies. There is a lot of information here, but the quick overview is that a very large percentage of the studies that attempted to collected new data and conduct the same analysis had insignificant results and smaller effects sizes. Given Clemens point on the definition of replication, let’s not call these failed replications. At least one person in the room labeled this publication bias. Either way, it is problematic.
The Many Labs Project assigned 27 teams to attempt to answer the same substantive questions with the same data, but they had the discretion on the coding, method, covariates and any other specification decisions. There was huge variation in the answers that different findings.
Nosek presented some evidence on which types of studies failed to replicate. He went through this very quickly and I can’t find any supporting materials online. But what was especially interesting is that an elite survey of experts in the field didn’t do an especially great job in predicting which studies replicated, but a prediction market did. The prediction markets under-predicted failed replications, but overall did a pretty decent job. Basically, given the right (monetary incentives), experts could sniff out the studies that weren’t going to replicate.

Within all of the presentations there was an acknowledgement of incentive problems from original authors (to have a finding and not be proven wrong) and replication authors (to find something new/wrong to report). But I didn’t hear any clear solutions.

Finally, there was very little discussion of the LaCour and Green controversy. One commentator noted that this is a case of replication success, not a failure. By making the data available science/Science was able to correct itself.

I’m not sure I am as optimistic. But it was nice to see that our discussions weren’t completely derailed by this crazy situation.

A few more sources.

Great resources from Gary King on replication here

Another study that attempts to define replication here

I forgot about Andrew Gelman’s Garden of Forking Paths