Setting up RStudio Server on a Cloud for Collaboration and Reproducibility
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Roland Stevenson is a data scientist and consultant who may be reached on Linkedin.
When setting up R and RStudio Server on a cloud Linux instance, some thought should be given to implementing a workflow that facilitates collaboration and ensures R project reproducibility. There are many possible workflows to accomplish this. In this post, we offer an “opinionated” solution based on what we have found to work in a production environment. We assume all development takes place on an RStudio Server cloud Linux instance, ensuring that only one operating system needs to be supported. We will keep the motivation for good versioning and reproducibility short: R projects evolve over time, as do the packages that they rely on. R projects that do not control package versions will eventually break and/or not be shareable or reproducible1.
Since R is a slowly evolving language, it might be reasonable to require that a particular Linux instance have only one version of R installed. However, requiring all R users to use the same versions of all packages to facilitate collaboration is clearly out of the question. The solution is to control package versions at the project level.
We use packrat
to control package versions. Already integrated with RStudio Server, packrat
ensures that all installed packages are stored with the project2, and that these packages are available when a project is opened. With packrat
, we know that project A will always be able to use ggplot2 2.5.0 and project B will always be able to use ggplot2 3.1.0. This is important if we want to be able to reproduce results in the future.
On Linux, packrat
stores compiled packages in packrat/lib/<LINUX_FLAVOR>/<R_VERSION>
, an R-version-specific path, relative to the project’s base directory. An issue arises if we are using R version 3.5.0 one week and then upgrade to R 3.5.1 the next week: a packrat
project will not find the 3.5.0 libraries anymore, and we will need to rebuild all the packages to install them in the 3.5.1 path. packrat
will automatically build all packages from source (sources are stored in packrat/src
) if it notices they are missing. However, this process can take tens of minutes, depending on the number of packages being built. Since this can be cumbersome when collaborating, we also opt to include the packrat/lib
path in version control, thereby committing the compiled libraries as well.
Our solution is to bind one fixed R version to an instance3 and release fixed-R instance images periodically. We prefer limited, consistent R-versions over continually upgrading to the most recent version of R. This approach helps to ensure reproducibility and make collaboration easier, avoids having to use docker containers4. While binding a fixed version of R to an instance may seem restrictive, we have found that it is in fact quite liberating. Since we only update the existing R version infrequently (think once a year), the barrier of agreeing on an R-version is removed and with it any need to agree on package versions at the user level. Instead, packages are distributed with the project via git. The benefits of fixing the R version for a particular instance are:
- Sharing
packrat
projects and reproducing results are both made easier, since pre-compiled libraries are included with the projects. - Fixing the R-version on an instance doesn’t keep us from upgrading R for a project, as
packrat
will automatically build and install libraries if an upgraded version is detected. In this way, a project can be opened on an instance with an upgraded R version and have its libraries compiled. Our limited instance image release schedule means the overhead to handle this only occurs at a maximum of once each year. - It is very unlikely that results will be different across R-versions, however being able to tie project results to one R-version allows us to upgrade R for a project while ensuring that results remain as expected.
What we lose by not being on the bleeding edge of (thankfully relatively non-critical) bug fixes we gain in ease of collaboration. Here’s what we’ve done to accomplish this:
- rstudio-instance contains branches with scripts to set up a Linux instance with fixed R and RStudio versions. We
git clone
the repo andgit checkout
the branch suitable for the Linux flavor, R-version, and RStudio version we want. The scripts also ensure R is not auto-updated in the future. - We then run the install script to set up the instance and archive an image of it for future use.
- Once the fixed-R instance is set up, rstudio-project contains an R-version specific base project with pre-built,
packrat
-managed, fixed-versions of many popular data-science packages5. - We
git clone
rstudio-project to a new project directory locally and remove the existing.git
directory so that it can be turned it into a new git repo withgit init
. - We open the project in RStudio and begin work. All packages are pre-built, so we don’t have to go through lengthy installs. We can upgrade packages in the
packrat
library of the “Packages” tab, and then runpackrat::snapshot()
to save any libraries and ugrades into the project’spackrat/
directory. We can thengit add packrat
to add anypackrat
updates to the project’s git repo. - If we ever need to duplicate results, we can always build the same fixed-R instance (or clone the image we stored earlier), clone the project on the instance, and know that it will work exactly the same as when we previously worked on it… sometimes years earlier.
Here is a quick example script showing the workflow:
git clone [email protected]:ras44/rstudio-instance.git cd rstudio-instance git checkout centos7_R3.5.0_RSS1.1.453 ./install.sh sudo passwd <USERNAME> # set user password for RStudio Server login cd git clone [email protected]:ras44/rstudio-project.git new-project cd new-project git checkout dev-linux-centos7-R3.5.0 rm -rf .git git init
Finally, here are some issues with packrat
that we have run into along with our solutions. Note that RStudio support has been very helpful in addressing issues while monitoring and providing solutions via their github issue tracker.
If R crashes and the
packrat
libraries are not accessible after the RStudio restarts the session, the project might need to be re-opened. Run.libPaths()
to ensure the project library paths are correct. Verify libraries are accessible by looking at the “packages” tab in RStudio Server and ensuring a “Project Library” header exists with all packages(see above image). Follow issue discussion.An issue can arise when some packages are updated but others aren’t. This can be challenging to troubleshoot and raises the question of what to do when package versions become incompatible with each other. This is not
packrat
, but version compatibility.Installing packages directly from a private/internal github is evolving. An easy solution exists: simply clone the package to a local directory such as
~/local_repos/
. Then useinstall_local()
to install from thelocal_repos
directory. See issue for details.packrat
can occasionally have very slow snapshots, particularly with projects that contains many R-Markdown files and packages. This is likely due topackrat
dependency searches. As discussed in the issue, we resolve it by ignoring all of our source directories withpackrat::set_opts(ignored.directories=c("all","my","R","src","directories")
and then runningpackrat::snapshot(ignore.stale=TRUE, infer.dependencies=FALSE)
.
- Unless you somehow exclusively use packages that are never updated, never implement version-breaking/major version updates, or always provide backwards-compatible version upgrades. Many R packages are in major version 0, meaning there is no guarantee that a future release will maintain the same API. ↩
- In the
packrat/
directory ↩ - It is possible to have multiple R versions installed on a system. I have avoided that for simplicity. ↩
- docker containers may be a good alternate solution, but in this case we are not using them. ↩
- rstudio-project contains all packages in the anaconda distribution and more. ↩
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.