Science is reportedly in the middle of a reproducibility crisis. Reproducibility seems laudable and is frequently called for (e.g., nature and science). In general the argument is that research that can be independently reproduced is more reliable than research that cannot be independently reproduced. It is also worth noting that reproducing research is not solely a checking process, and it can provide useful jumping-off points for future research questions. It is difficult to find a counter-argument to these claims, but arguing that reproducibility is laudable in general glosses over the fact that for each research group it is a significant amount of work to make their research (easily) reproducible for independent scientists. While much of the attention has focused on entirely repeating laboratory experiments, there are many simpler forms of reproducibility including, for example, the ability to recompute analyses on known sets of data.
Different types of scientific research are inherently easier or harder to reproduce. At one extreme is analytic mathematical research, which should in many cases allow for straightforward reproduction based on the equations in the manuscript. At the other extreme are field-based studies, which may depend upon factors that are not under the control of the scientist. To use an extreme example, it will always be effectively impossible to entirely reproduce a before and after study of the effects of a hurricane.
The current frontier of reproducibility is somewhere between these two extreme examples, and the location of this frontier at any given time depends upon the set of tools available to researchers. Open source software, cloud computing, data archiving, standardised biological materials, and widely available computing resources have all pushed this frontier to allow for the reproduction of more types of research than was previously the case. However, the "reproducibility crisis" rhetoric suggests that the current set of tools, while substantial, has not completely solved the problem.
We recently worked on a project -- a moderately complex analysis of a moderately sized database (49061 rows) -- that we treated as an experiment to determine what it would take to make it fully reproducible. (This was to answer a very simple research question: what proportion of the world's plant species are woody?.) Our specific experiences in trying to make this research reproducible may be useful for the on-going discussion of how to allow scientists with less time and fewer technical skills than we had available to make their research reproducible. In other words, how do we usefully move the "frontier of reproducibility" to include more types of studies and in doing so make more science more reliable.
In the end, our analysis and paper have been reproduced independently and it is relatively easy for anyone who wants to do so, but implementing this level of reproducibility was not without considerable effort. For those interested, the entirety of our the code and documentation is available here.
There are two parts to the reproducibility of a project such as this: the data and the analysis. We should note that the fact that this project was even possible is due to the recent developments in data archiving. It was relatively straightforward to write a script that downloads the main data from Dryad and prepare it for analysis. However, this proved to be only the beginning of the challenge: the analysis portion turned out to be much more challenging. The following is essentially a list of lessons learned from that experience. Each point below details one of the challenges we faced in making our research reproducible and the tool we chose to address that challenge.
Challenges and tools for reproducibility
Using canonical data sources
We downloaded data from canonical sources (Dryad and The Plant List) and only modified them programmatically, so that the chain of modification was preserved. The benefits of open data will only be realised if we preserve identity of data sets and do not end up re-archiving hundreds of slightly modified versions. This also helps ensure credit for data contributors. However, issues such as taxonomic standardisation remain a real stumbling block for ecological data reuse.
Combining thoughts and code
We used knitr to implement the analysis in a literate programming style. The entire analysis, including justification of the core functions, is available to interested people. However, working with blocks of ugly data-wrangling code, or with long-running calculations, remains a challenge.
Dynamic generation of figures
All of our data manipulation was handled with scripts, and we could delete all figures/outputs and recreate them at will.
Automated caching of dependencies
We used make to document dependencies within components of the projects, rebuilding only the sections that required changing. This also makes the build process somewhat self documenting.
All of our scripts were under version control using git from the beginning, enabling us to dig back through old versions. This was central to everything we did! See this article for much more on how version control facilitates research.
Automated checking that modifications don't break things
We used the "continuous integration" environment Travis CI to guard against changes in the analysis causing it to fail. Every time we made a change, this system downloads the source code, all relevant data and runs the analysis, sending us an email if anything failed. It even uploads the compiled versions of the analysis and manuscript each time it runs.
We used packrat for managing and archiving R package dependencies to ensure future repeatability. In theory, this means that if software versions change enough to break our scripts, we have an archived set of packages that can be used. This is a very new tool; only time will tell if this will work.
We found that moving from running analyses on one person's computer (with their particular constellation of software locations) to another was difficult. For example, see this issue for the trouble that we had running the analyses on our own computers, knowing the scope of the project. It's hard to anticipate all possible causes for confusion: one initial try at replication by Carl Boettiger had trouble due to incomplete documentation of required package versions.
The set of scripts that manages the above jobs is comparable in size to the actual analysis; this is a large overhead to place on researchers. There are also many different languages and frameworks involved, increasing both the technical knowledge required and the chance that something will break. Automating as much of this process as possible is essential for reproducibility to become standard practice.
The continuous integration approach has a huge potential to save headaches in managing computational research projects. However, while our analysis acts as a proof-of-concept, it will be of limited general use: it requires that the project is open source (in a public github repository), and that the analysis is relatively quick to run (under an hour). These limitations are reasonable given that it is a free service, but they don't match well with many research projects where development does not occur "in the open", and where computation can take many hours or days.
We found our reproducibility goals for this paper to be a useful exercise, and it forms the basis of ongoing research projects. However, the process is far too complicated at the moment. It is not going to be enough to simply tell people to make their projects reproducible. We need to develop tools that are at least as easy to use as version control before we can expect project reproducibility to become mainstream.
We don't disagree with Titus Brown that partial reproducibility is better than nothing (50% of people making their work 50% reproducible would be better than 5% of people making their work 100% reprooducible!). However, we disagree with Titus in his contention that new tools are not needed. The current tools are very raw and too numerous to expect widespread adoption from scientists whose main aim is not reproducibility. Given that reproducibility isn't compelling, we can't expect people to pour their time into it just for some public greater good, especially if it comes with a large time cost.
Other efforts for this simple goal of recomputibility are not much more encouraging than ours. A study in the UBC Reproducibility Group found that they could not reproduce the results in 30% of published analyses using the population genetic package STRUCTURE, using the same data as provided by the authors. In an even more trivial case, a research group at Arizona University found that they could only build about half of the published software that they could download, without even testing that the software did what it was intended to do (note that this study is currently being reproduced!).
The process of making our study reproducible reveals that we are only part of the way to making reproducible research broadly accessible to practicing scientists.