Site icon R-bloggers

Setting up RStudio Server on a Cloud for Collaboration and Reproducibility

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Roland Stevenson is a data scientist and consultant who may be reached on Linkedin.

When setting up R and RStudio Server on a cloud Linux instance, some thought should be given to implementing a workflow that facilitates collaboration and ensures R project reproducibility. There are many possible workflows to accomplish this. In this post, we offer an “opinionated” solution based on what we have found to work in a production environment. We assume all development takes place on an RStudio Server cloud Linux instance, ensuring that only one operating system needs to be supported. We will keep the motivation for good versioning and reproducibility short: R projects evolve over time, as do the packages that they rely on. R projects that do not control package versions will eventually break and/or not be shareable or reproducible1.

Since R is a slowly evolving language, it might be reasonable to require that a particular Linux instance have only one version of R installed. However, requiring all R users to use the same versions of all packages to facilitate collaboration is clearly out of the question. The solution is to control package versions at the project level.

We use packrat to control package versions. Already integrated with RStudio Server, packrat ensures that all installed packages are stored with the project2, and that these packages are available when a project is opened. With packrat, we know that project A will always be able to use ggplot2 2.5.0 and project B will always be able to use ggplot2 3.1.0. This is important if we want to be able to reproduce results in the future.

On Linux, packrat stores compiled packages in packrat/lib/<LINUX_FLAVOR>/<R_VERSION>, an R-version-specific path, relative to the project’s base directory. An issue arises if we are using R version 3.5.0 one week and then upgrade to R 3.5.1 the next week: a packrat project will not find the 3.5.0 libraries anymore, and we will need to rebuild all the packages to install them in the 3.5.1 path. packrat will automatically build all packages from source (sources are stored in packrat/src) if it notices they are missing. However, this process can take tens of minutes, depending on the number of packages being built. Since this can be cumbersome when collaborating, we also opt to include the packrat/lib path in version control, thereby committing the compiled libraries as well.

Our solution is to bind one fixed R version to an instance3 and release fixed-R instance images periodically. We prefer limited, consistent R-versions over continually upgrading to the most recent version of R. This approach helps to ensure reproducibility and make collaboration easier, avoids having to use docker containers4. While binding a fixed version of R to an instance may seem restrictive, we have found that it is in fact quite liberating. Since we only update the existing R version infrequently (think once a year), the barrier of agreeing on an R-version is removed and with it any need to agree on package versions at the user level. Instead, packages are distributed with the project via git. The benefits of fixing the R version for a particular instance are:

What we lose by not being on the bleeding edge of (thankfully relatively non-critical) bug fixes we gain in ease of collaboration. Here’s what we’ve done to accomplish this:

Here is a quick example script showing the workflow:

git clone git@github.com:ras44/rstudio-instance.git
cd rstudio-instance
git checkout centos7_R3.5.0_RSS1.1.453
./install.sh
sudo passwd <USERNAME> # set user password for RStudio Server login
cd
git clone git@github.com:ras44/rstudio-project.git new-project
cd new-project
git checkout dev-linux-centos7-R3.5.0
rm -rf .git
git init

Finally, here are some issues with packrat that we have run into along with our solutions. Note that RStudio support has been very helpful in addressing issues while monitoring and providing solutions via their github issue tracker.


  1. Unless you somehow exclusively use packages that are never updated, never implement version-breaking/major version updates, or always provide backwards-compatible version upgrades. Many R packages are in major version 0, meaning there is no guarantee that a future release will maintain the same API.
  2. In the packrat/ directory
  3. It is possible to have multiple R versions installed on a system. I have avoided that for simplicity.
  4. docker containers may be a good alternate solution, but in this case we are not using them.
  5. rstudio-project contains all packages in the anaconda distribution and more.

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.