Module 3 The reproducibility crisis

There is a crisis in science (and data science): the reproducibility crisis (also known as the replication crisis). This refers to the fact that many scientific studies have been impossible to reproduce, calling into question the validity of those studies’ findings. This is a big deal: if a significant part of science is wrong then what do we know, how can we be sure what we know is right, and how can we separate the wheat from the chaff?

Because of this crisis, there has emerged a much needed move to make all science “reproducible”. This means making sure that someone else can copy what you did, and get the same results. This is important for identifying scientific fraud, of course, but also for helping us to overcome human bias, mistakes, wishful thinking, etc. Reproducibility is not just a “nice-to-have”; in modern science (and data science), it’s a “must”.

Good data science must be reproducible. And reproducible science means using tools that others can easily use, and methods that others can easily copy. Programming languages like R and Python are ideal for this.

In this course, we’ll focus on reproducible research, literate programming, documentation, and other components of data science (and research in general) which ensure that (a) our methods and findings can be easily sanity-tested by others, and (b) we set ourselves and our projects’ up for future collaborations, hand-offs, and expansion.

What is ‘reproducible research’?

Reproducible research is the idea that work done by scientist A is “reproducible” by scientist B. In other words, if the findings of the research are of any generalizable value, then the results of two scientists working on the same problem should be identical (or very high in agreement).

In practice, this means using data and code in a structured, well-documented, accessible, clear way, and ensuring that others can do the same.

Why does reproducible research matter?

Reproducible research matters for lots of reasons:

  1. Because making your work reproducible means that you will have less problems returning to that work at a later time.
  2. Because making your work reproducible means that others can collaborate with you, help you, error-check you, and build on your work.
  3. Because making your work reproducible means you are fighting the plague of irreproducible results which have characterized the replication crisis

How to carry out reproducible research

  • Make your code open source
  • Put your stuff on github
  • Use open source tools
  • Use free tools
  • Document everything you do
  • Collaborate with others

Final thoughts

An article about RESULTS is advertising, not scholarship. An article with transparent, reproducible methods is scholarship.