By Doug Ashton – Data Scientist, UK
Just like you I like to try out all the latest tech. If there’s a new feature in Shiny then I’ll download the latest version without thinking. I’ve currently got 4 versions of R on my laptop, 270 packages, 2 versions of Java, and a number of other open source tools. While being on the cutting edge is part of my job, this conflicts with the need for strict audit and reproducibility requirements that we have for project work.
One problem with R is that due to the fast changing nature of CRAN it can be difficult to gain a consistent combination of packages across your team and production servers. The R community has responded to this problem with a number of noteworthy packages for managing package libraries, such as packrat, checkpoint, switchr and our own pkgsnap. Another approach is to use the MRAN mirror to freeze CRAN to a particular date.
A bigger problem is how R is interacting with the various system depenedencies you have installed. At Mango this is why we use continuous integration and unit testing to make sure our results are reproducible on dedicated build servers. Even this can leave you scratching your head when tests don’t match.
All this led us to look for a better way of working. We needed an environment that was easily reproducible, and more in line with the production environment we are deploying to. We’ve already been using Docker for some time so this was the natural choice.
As described in a previous post, Docker is designed to provide an isolated, portable and repeatable wrapper around your applications. We use this in a number of ways:
1. Reproducible environments
Each project can run inside its own container, completely sandboxed from the rest of your system. We have a number of base images, each built on specific R versions and provisioned with standard sets of packages (using our pkgsnap package) and RStudio Server. Each project can build on one of these images with any specific package dependencies. The recipe to build this image is stored in the Dockerfile that can be saved in the project directory. An example project Docker file is shown in this demonstration.
2. System dependencies
If there are system dependencies such as database connections or external libraries, then building an image with these installed makes it much easier to distribute the project to others. This also makes Docker a great way of trying a new technology without the pain of installing it on your system. For example the excellent Jupyter/all-spark-notebook has everything you need to get started with Spark from R, Python or Scala.
Once you’re used to working in containers it can significantly lower the barrier to scaling up the compute power when needed. Your container will work just the same on your laptop and a 32 core EC2 instance. You just spin up a node, pull the image and deploy your application. Multiple containers from the same image can be spawned across a grid in seconds and a small scale Spark cluster can be swapped out for a much larger one.
For larger software development projects we also use Vagrant as a tool for reproducible development environments. As described in an earlier post Vagrant is a set of command line tools for managing virtual machines (VMs). This creates a dedicated VM for each project that is consistent across the development team and only creates a small file in version control.