Reproducible Environments

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Great data science work should be reproducible. The ability to repeat
experiments is part of the foundation for all science, and reproducible work is
also critical for business applications. Team collaboration, project validation,
and sustainable products presuppose the ability to reproduce work over time.

In my opinion, mastering just a handful of important tools will make
reproducible work in R much easier for data scientists. R users should be
familiar with version control, RStudio projects, and literate programming
through R Markdown. Once these tools are mastered, the major remaining challenge
is creating a reproducible environment.

An environment consists of all the dependencies required to enable your code to
run correctly. This includes R itself, R packages, and system dependencies. As
with many programming languages, it can be challenging to manage reproducible R
environments. Common issues include:

  • Code that used to run no longer runs, even though the code has not changed.
  • Being afraid to upgrade or install a new package, because it might break your code or someone else’s.
  • Typing install.packages in your environment doesn’t do anything, or doesn’t do the right thing.

These challenges can be addressed through a careful combination of tools and
strategies. This post describes two use cases for reproducible environments:

  1. Safely upgrading packages
  2. Collaborating on a team

The sections below each cover a strategy to address the use case, and the necessary
tools to implement each strategy. Additional use cases, strategies, and tools are
presented at This website is a work in
progress, but we look forward to your feedback.

Safely Upgrading Packages

Upgrading packages can be a risky affair. It is not difficult to find serious R
users who have been in a situation where upgrading a package had unintended
consequences. For example, the upgrade may have broken parts of their current code, or upgrading a
package for one project accidentally broke the code in another project. A
strategy for safely upgrading packages consists of three steps:

  1. Isolate a project
  2. Record the current dependencies
  3. Upgrade packages

The first step in this strategy ensures one project’s packages and upgrades
won’t interfere with any other projects. Isolating projects is accomplished by
creating per-project libraries. A tool that makes this easy is the new renv
. Inside of your R project, simply use:

# inside the project directory

The second step is to record the current dependencies. This step is critical
because it creates a safety net. If the package upgrade goes poorly, you’ll be
able to revert the changes and return to the record of the working state. Again,
the renv package makes this process easy.

# record the current dependencies in a file called renv.lock

# commit the lockfile alongside your code in version control
# and use this function to view the history of your lockfile

# if an upgrade goes astray, revert the lockfile
renv::revert(commit = "abc123")

# and restore the previous environment

With an isolated project and a safety net in place, you can now proceed to
upgrade or add new packages, while remaining certain the current functional
environment is still reproducible. The pak
can be used to install and upgrade
packages in an interactive environment:

# upgrade packages quickly and safely

The safety net provided by the renv package relies on access to older versions
of R packages. For public packages, CRAN provides these older versions in the
CRAN archive. Organizations can
use tools like RStudio Package
to make multiple versions
of private packages available. The “snapshot and
approach can also be used
to promote content to production. In
fact, this approach is exactly how RStudio
and deploy thousands of R applications to
production each day!

Team Collaboration

A common challenge on teams is sharing and running code. One strategy that
administrators and R users can adopt to facilitate collaboration is
shared baselines. The basics of the strategy are simple:

  1. Administrators setup a common environment for R users by installing RStudio Server.
  2. On the server, administrators install multiple versions of R.
  3. Each version of R is tied to a frozen repository using a file.

By using a frozen repository, either administrators or users can install
packages while still being sure that everyone will get the same set of packages.
A frozen repository also ensures that adding new packages won’t upgrade other
shared packages as a side-effect. New packages and upgrades are offered to users
over time through the addition of new versions of R.

Frozen repositories can be created by manually cloning CRAN, accessing a service
like MRAN, or utilizing a supported product like RStudio Package

Adaptable Strategies

The prior sections presented specific strategies for creating reproducible
environments in two common cases. The same strategy may not be appropriate for
every organization, R user, or situation. If you’re a student reporting an
error to your professor, capturing your sessionInfo() may be all you need. In
contrast, a statistician working on a clinical trial will need a robust
framework for recreating their environment. Reproducibility is not binary!

To help pick between strategies, we’ve developed a strategy
. By answering two questions,
you can quickly identify where your team falls on this map and identify the
nearest successful strategy. The two questions are represented on the x and
y-axis of the map:

  1. Do I have any restrictions on what packages can be used?
  2. Who is responsible for managing installed packages?

For more information on picking and using these strategies, please visit By adopting a strategy for reproducible
environments, R users, administrators, and teams can solve a number of important
challenges. Ultimately, reproducible work adds credibility, creating a solid
foundation for research, business applications, and production systems. We are
excited to be working on tools to make reproducible work in R easy and fun. We
look forward to your feedback, community discussions, and future posts.

To leave a comment for the author, please follow the link and comment on their blog: R Views. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)