Reproducible Environments

April 21, 2019
By

(This article was first published on R Views, and kindly contributed to R-bloggers)




Great data science work should be reproducible. The ability to repeat
experiments is part of the foundation for all science, and reproducible work is
also critical for business applications. Team collaboration, project validation,
and sustainable products presuppose the ability to reproduce work over time.

In my opinion, mastering just a handful of important tools will make
reproducible work in R much easier for data scientists. R users should be
familiar with version control, RStudio projects, and literate programming
through R Markdown. Once these tools are mastered, the major remaining challenge
is creating a reproducible environment.

An environment consists of all the dependencies required to enable your code to
run correctly. This includes R itself, R packages, and system dependencies. As
with many programming languages, it can be challenging to manage reproducible R
environments. Common issues include:

  • Code that used to run no longer runs, even though the code has not changed.
  • Being afraid to upgrade or install a new package, because it might break your code or someone else’s.
  • Typing install.packages in your environment doesn’t do anything, or doesn’t do the right thing.

These challenges can be addressed through a careful combination of tools and
strategies. This post describes two use cases for reproducible environments:

  1. Safely upgrading packages
  2. Collaborating on a team

The sections below each cover a strategy to address the use case, and the necessary
tools to implement each strategy. Additional use cases, strategies, and tools are
presented at https://environments.rstudio.com. This website is a work in
progress, but we look forward to your feedback.

Safely Upgrading Packages

Upgrading packages can be a risky affair. It is not difficult to find serious R
users who have been in a situation where upgrading a package had unintended
consequences. For example, the upgrade may have broken parts of their current code, or upgrading a
package for one project accidentally broke the code in another project. A
strategy for safely upgrading packages consists of three steps:

  1. Isolate a project
  2. Record the current dependencies
  3. Upgrade packages

The first step in this strategy ensures one project’s packages and upgrades
won’t interfere with any other projects. Isolating projects is accomplished by
creating per-project libraries. A tool that makes this easy is the new renv
package
. Inside of your R project, simply use:

# inside the project directory
renv::init()

The second step is to record the current dependencies. This step is critical
because it creates a safety net. If the package upgrade goes poorly, you’ll be
able to revert the changes and return to the record of the working state. Again,
the renv package makes this process easy.

# record the current dependencies in a file called renv.lock
renv::snapshot()

# commit the lockfile alongside your code in version control
# and use this function to view the history of your lockfile
renv::history()

# if an upgrade goes astray, revert the lockfile
renv::revert(commit = "abc123")

# and restore the previous environment
renv::restore()

With an isolated project and a safety net in place, you can now proceed to
upgrade or add new packages, while remaining certain the current functional
environment is still reproducible. The pak
package
can be used to install and upgrade
packages in an interactive environment:

# upgrade packages quickly and safely
pak::pkg_install("ggplot2")

The safety net provided by the renv package relies on access to older versions
of R packages. For public packages, CRAN provides these older versions in the
CRAN archive. Organizations can
use tools like RStudio Package
Manager
to make multiple versions
of private packages available. The “snapshot and
restore”
approach can also be used
to promote content to production. In
fact, this approach is exactly how RStudio
Connect
and
shinyapps.io deploy thousands of R applications to
production each day!

Team Collaboration

A common challenge on teams is sharing and running code. One strategy that
administrators and R users can adopt to facilitate collaboration is
shared baselines. The basics of the strategy are simple:

  1. Administrators setup a common environment for R users by installing RStudio Server.
  2. On the server, administrators install multiple versions of R.
  3. Each version of R is tied to a frozen repository using a Rprofile.site file.

By using a frozen repository, either administrators or users can install
packages while still being sure that everyone will get the same set of packages.
A frozen repository also ensures that adding new packages won’t upgrade other
shared packages as a side-effect. New packages and upgrades are offered to users
over time through the addition of new versions of R.

Frozen repositories can be created by manually cloning CRAN, accessing a service
like MRAN, or utilizing a supported product like RStudio Package
Manager
.

Adaptable Strategies

The prior sections presented specific strategies for creating reproducible
environments in two common cases. The same strategy may not be appropriate for
every organization, R user, or situation. If you’re a student reporting an
error to your professor, capturing your sessionInfo() may be all you need. In
contrast, a statistician working on a clinical trial will need a robust
framework for recreating their environment. Reproducibility is not binary!

To help pick between strategies, we’ve developed a strategy
map
. By answering two questions,
you can quickly identify where your team falls on this map and identify the
nearest successful strategy. The two questions are represented on the x and
y-axis of the map:

  1. Do I have any restrictions on what packages can be used?
  2. Who is responsible for managing installed packages?

For more information on picking and using these strategies, please visit
https://environments.rstudio.com. By adopting a strategy for reproducible
environments, R users, administrators, and teams can solve a number of important
challenges. Ultimately, reproducible work adds credibility, creating a solid
foundation for research, business applications, and production systems. We are
excited to be working on tools to make reproducible work in R easy and fun. We
look forward to your feedback, community discussions, and future posts.

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)