A fundamental aspect of the reproducible research framework is that (statistical) analysis can be reproduced; that is, given a set of instructions (or a script file) the exact results can be achieved by another analyst with the same raw data. This idea may seem intuitive, but in practice it can be difficult to achieve in an analytical environment that is always evolving and changing.
For example, Tidyverse users will have recently seen the following warning in R:
'data_frame()' is depreciated, use 'tibble()'.
data_frame(a = 1:3)
This is because
dplyr has changed, hopefully for the better, and
tibble() is now the preferred route. It is conceivable that in a future release the
data_frame() function will no longer be a part of
dplyr at all.
So, what does this have to do with reproducible research? Say you want to come back to your analysis in 5 years and re-run your old code. Your future installation of R with the latest Tidyverse installation may not be backwards compatible with your now ancient old code and the analysis will crash.
There are a couple of strategies that we could use to deal with updating dependencies. The first, and possibly the easiest, is to use
devtools to install older package versions (see here for further details). But what if the version of the package you used is not archived on CRAN? For example, the analysis could have used a package from GitHub that has since changed and the maintainer hasn’t used systematic releases for you to pull from. Thus, this strategy is quite likely to fail.
Another solution is to put your entire project inside a static virtual environment that is isolated from the rest of your machine. The benefit of project isolation is that any dependencies that are installed within the environment can be made persistent, but are also separated from any dependency upgrades that might be applied to your host machine or to other projects.
Docker isn’t a new topic for regular R-Bloggers readers, but for those of you that are unfamiliar: Docker is a program that uses virtual containers, which isolate and bundle applications. The containers are defined by a set of instructions detailed in a
Dockerfile and can effectively be stacked to call upon the capabilities of base images. Furthermore, one of the fundamental tenets of Docker is portability – if a container will work on your local machine then it will work anywhere. And because base images are versioned, they will work for all time. This is also great for scalability onto servers and distribution to colleagues and collaborators who will see exactly what you have prepared regardless of their host operating system.
Our group at the Telethon Kids Institute has prepared source code to launch a dockerised RStudio Server instance via a NGINX reverse proxy enabled for HTTPS connections. Our GitHub repository provides everything you need to quickly establish a new project (just clone the repository) with the features of RStudio Web Server with the added benefit of SSL encryption (this will need some local configuration); even though RStudio Server is password protected, web encryption is still important for us as we routinely deal with individual level health data.
(Important note, even with data security measures in place, our policy is for patient data to be de-identified prior to analysis as per good clinical practice norms and, except for exceptional circumstances, we would use our intranet to further restrict access).
docker-compose.yml that we use can be found at out our GitHub site via this link: https://github.com/TelethonKids/rstudio. We used an RStudio base image with Tidyverse pre-installed with R 3.5.3, which is maintained by rocker at https://hub.docker.com/r/rocker/tidyverse. As updates are made to the base-image, the repository will be updated with new releases that provide an opportunity to re-install a specific version of the virtual environment. It has also been set up with persistent volumes for the projects, rstudio, and lib/R directories to keep any changes made to the virtual environment for back-up and further version control.
By combining tools like Docker and Git we believe we can refine and make common place, within our institute and those we collaborate with, a culture of reproducible research as we conduct world class research to improve the lives of sick children.