Setting up RStudio Server on a Cloud for Collaboration and Reproducibility

April 16, 2019
By

(This article was first published on R Views, and kindly contributed to R-bloggers)

Roland Stevenson is a data scientist and consultant who may be reached on Linkedin.

When setting up R and RStudio Server on a cloud Linux instance, some thought should be given to implementing a workflow that facilitates collaboration and ensures R project reproducibility. There are many possible workflows to accomplish this. In this post, we offer an “opinionated” solution based on what we have found to work in a production environment. We assume all development takes place on an RStudio Server cloud Linux instance, ensuring that only one operating system needs to be supported. We will keep the motivation for good versioning and reproducibility short: R projects evolve over time, as do the packages that they rely on. R projects that do not control package versions will eventually break and/or not be shareable or reproducible1.

Since R is a slowly evolving language, it might be reasonable to require that a particular Linux instance have only one version of R installed. However, requiring all R users to use the same versions of all packages to facilitate collaboration is clearly out of the question. The solution is to control package versions at the project level.

R system, user, and Packrat library locations in Linux CentOS 7

We use packrat to control package versions. Already integrated with RStudio Server, packrat ensures that all installed packages are stored with the project2, and that these packages are available when a project is opened. With packrat, we know that project A will always be able to use ggplot2 2.5.0 and project B will always be able to use ggplot2 3.1.0. This is important if we want to be able to reproduce results in the future.

On Linux, packrat stores compiled packages in packrat/lib//, an R-version-specific path, relative to the project’s base directory. An issue arises if we are using R version 3.5.0 one week and then upgrade to R 3.5.1 the next week: a packrat project will not find the 3.5.0 libraries anymore, and we will need to rebuild all the packages to install them in the 3.5.1 path. packrat will automatically build all packages from source (sources are stored in packrat/src) if it notices they are missing. However, this process can take tens of minutes, depending on the number of packages being built. Since this can be cumbersome when collaborating, we also opt to include the packrat/lib path in version control, thereby committing the compiled libraries as well.

Our solution is to bind one fixed R version to an instance3 and release fixed-R instance images periodically. We prefer limited, consistent R-versions over continually upgrading to the most recent version of R. This approach helps to ensure reproducibility and make collaboration easier, avoids having to use docker containers4. While binding a fixed version of R to an instance may seem restrictive, we have found that it is in fact quite liberating. Since we only update the existing R version infrequently (think once a year), the barrier of agreeing on an R-version is removed and with it any need to agree on package versions at the user level. Instead, packages are distributed with the project via git. The benefits of fixing the R version for a particular instance are:

  • Sharing packrat projects and reproducing results are both made easier, since pre-compiled libraries are included with the projects.
  • Fixing the R-version on an instance doesn’t keep us from upgrading R for a project, as packrat will automatically build and install libraries if an upgraded version is detected. In this way, a project can be opened on an instance with an upgraded R version and have its libraries compiled. Our limited instance image release schedule means the overhead to handle this only occurs at a maximum of once each year.
  • It is very unlikely that results will be different across R-versions, however being able to tie project results to one R-version allows us to upgrade R for a project while ensuring that results remain as expected.

What we lose by not being on the bleeding edge of (thankfully relatively non-critical) bug fixes we gain in ease of collaboration. Here’s what we’ve done to accomplish this:

  • rstudio-instance contains branches with scripts to set up a Linux instance with fixed R and RStudio versions. We git clone the repo and git checkout the branch suitable for the Linux flavor, R-version, and RStudio version we want. The scripts also ensure R is not auto-updated in the future.
  • We then run the install script to set up the instance and archive an image of it for future use.
  • Once the fixed-R instance is set up, rstudio-project contains an R-version specific base project with pre-built, packrat-managed, fixed-versions of many popular data-science packages5.
  • We git clone rstudio-project to a new project directory locally and remove the existing .git directory so that it can be turned it into a new git repo with git init.
  • We open the project in RStudio and begin work. All packages are pre-built, so we don’t have to go through lengthy installs. We can upgrade packages in the packrat library of the “Packages” tab, and then run packrat::snapshot() to save any libraries and ugrades into the project’s packrat/ directory. We can then git add packrat to add any packrat updates to the project’s git repo.
  • If we ever need to duplicate results, we can always build the same fixed-R instance (or clone the image we stored earlier), clone the project on the instance, and know that it will work exactly the same as when we previously worked on it… sometimes years earlier.

Here is a quick example script showing the workflow:

git clone [email protected]:ras44/rstudio-instance.git
cd rstudio-instance
git checkout centos7_R3.5.0_RSS1.1.453
./install.sh
sudo passwd  # set user password for RStudio Server login
cd
git clone [email protected]:ras44/rstudio-project.git new-project
cd new-project
git checkout dev-linux-centos7-R3.5.0
rm -rf .git
git init

Finally, here are some issues with packrat that we have run into along with our solutions. Note that RStudio support has been very helpful in addressing issues while monitoring and providing solutions via their github issue tracker.

  • If R crashes and the packrat libraries are not accessible after the RStudio restarts the session, the project might need to be re-opened. Run .libPaths() to ensure the project library paths are correct. Verify libraries are accessible by looking at the “packages” tab in RStudio Server and ensuring a “Project Library” header exists with all packages(see above image). Follow issue discussion.

  • An issue can arise when some packages are updated but others aren’t. This can be challenging to troubleshoot and raises the question of what to do when package versions become incompatible with each other. This is not packrat, but version compatibility.

  • Installing packages directly from a private/internal github is evolving. An easy solution exists: simply clone the package to a local directory such as ~/local_repos/. Then use install_local() to install from the local_repos directory. See issue for details.

  • packrat can occasionally have very slow snapshots, particularly with projects that contains many R-Markdown files and packages. This is likely due to packrat dependency searches. As discussed in the issue, we resolve it by ignoring all of our source directories with packrat::set_opts(ignored.directories=c("all","my","R","src","directories") and then running packrat::snapshot(ignore.stale=TRUE, infer.dependencies=FALSE).


  1. Unless you somehow exclusively use packages that are never updated, never implement version-breaking/major version updates, or always provide backwards-compatible version upgrades. Many R packages are in major version 0, meaning there is no guarantee that a future release will maintain the same API.
  2. In the packrat/ directory
  3. It is possible to have multiple R versions installed on a system. I have avoided that for simplicity.
  4. docker containers may be a good alternate solution, but in this case we are not using them.
  5. rstudio-project contains all packages in the anaconda distribution and more.

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)