Open science has grown tremendously in the past few years. While there’s stilla long way to go, the availability of data, software, and other materials is making it possible to re-use these products to expand upon previous work and apply them to new areas. Through responsible conduct of research (RCR) training, journal requirements, changing individual and institutional principles, and open access evangelism, it’s now much more common for researchers to package their work with the intention of sharing it with others. What exactly this entails depends on a lot of things, including the field of research, the type of data, and how the data were processed and analyzed. At a minimum, one would hope for all of the appropriate data, a description of the software used for analysis or the software itself, and some metadata describing the data and how it was processed.
One aspect that has received little or no attention is the software environment that was used. By software environment, I mean all of the relevant software that was used by the project. This includes not just “R 3.1.0″ or “Python 2.7.8″, but also the additional modules and packages and their specific versions. Given the rapid rate at which environments like R and Python are changing, in particular their commonly-used (and extremely powerful) add-on packages like dplyr, data.table, Pandas, and SciPy, it can very easily become the case that scripts run in a specific environment no longer properly work after a just short period of time. That puts a clock on the reproducibility of your project, which can limit its impact and usefulness.
This post describes Packrat, a package management tool for R. Packrat allows you to easily keep track of the software environment used for a particular project. It also makes the process of recreating that environment almost transparent. Because of how painless it makes sharing reproducible projects—either openly with the public, with collaborators, or just between different machines—Packrat is an essential tool for R.
Although this post is specific to R and Packrat, I hope that it helps users of other environments to think about how they can create reproducible software environments for their own work. Python users should definitely check out virtualenv (and A non-magical introduction to Pip and Virtualenv for Python beginners). These tools are very easy to use, so using them to help make your projects reproducible, either from the beginning or at the end, is a minimal investment that can really pay off.
The Big Picture
First, let’s get an overall idea of how Packrat works. You’ll need to have all the data and scripts related to your project in its own directory. How these files are named and organized is up to you. My personal preference is to create an R package that contains all the data, scripts, and documentation for the project, but that’s not necessary.
Once you initialize Packrat on that directory, it will keep track of which version of R is used as well as which packages are used in your scripts, including their version information and source code. This makes your project completely self-contained. Whenever (and wherever) this project is opened, Packrat will make sure that the packages that it uses are available. This means that the project can now be shared, and anyone using it can do so with the same software environment.
Note that because Packrat stores all of the packages used as well as their source code, projects using Packrat will require more disk space.
Packrat is still under development, but it can be installed using devtools.
Initializing Packrat for a Project
Once your project’s scripts are collected in their own directory, we can initialize Packrat to monitor those scripts.
init will find any packages used by the scripts in your project and download the source code for the version used. It will also store the version of R and a copy of itself, so Packrat does not need to already be installed on other machines in order for your project to be used. All of this will be stored in the newly-created directory named
packrat. It will also create an
.Rprofile file in the project directory. This script runs automatically when the project opened, and takes care of the heavy lifting. It will make sure that each of the packages on which the project depends are available, which includes building them if necessary. The collection of packages managed by Packrat is called the private library. Your R session will only know about packages available in this library.
If you decide to use additional packages after the project has been initialized, you can add them by installing them. For example, if we decide to rearrange a data set using reshape2, we can install it and its dependencies into to the private library.
Typically, the private library changes when packages are installed or removed or when a package is used or no longer used in a script. By default, Packrat will keep a close eye on your scripts, making sure that the private library is kept up to date. This section demonstrates how to manage the private library using the
We’ve already seen how to install packages to the private library. Continuing with our previous example, if your scripts don’t yet use reshape2, Packrat will detect this and tell you that it and its dependencies are in the private library, but that they are not needed.
The following packages are installed but not needed: _ plyr 1.8.1 Rcpp 0.11.2 reshape2 1.4 Use packrat::clean() to remove them. Or, if they are actually needed by your project, add `library(packagename)` calls to a .R file somewhere in your project.
Once you start using reshape2 in your scripts, packrat will report that everything is fine.
Up to date.
When a Package is Not Available
Since starting with Packrat, my most common mistake is to start using a package in my scripts without first installing that package to the private library. When this happens,
status will let you know (and the scripts won’t work). Let’s say I add functions from pushoverr to my scripts without first installing it:
Error in getPackageRecords(inferredPkgNames, project = project, lib.loc = lib.loc, : Unable to retrieve package records for the following pacakges: - 'bitops', 'digest', 'httr', 'pushoverr', 'RCurl'
Although this error message is somewhat cryptic, it shows that the library does not contain pushoverr or its dependencies. The problem can be resolved by installing pushoverr.
Up to date.
When a Package is No Longer Used
Packrat will also detect the other situation: when you stop using a library in your project’s scripts. For example, if we no longer use pushoverr,
status will display the packages included in the private library that are no longer used.
The following packages are installed but not needed: _ bitops 1.0-6 digest 0.6.4 httr 0.3 pushoverr 0.1.3 RCurl 1.95-4.1 Use packrat::clean() to remove them. Or, if they are actually needed by your project, add `library(packagename)` calls to a .R file somewhere in your project.
Packrat makes it pretty clear how to handle this situation. Either the packages can be removed from the private library with
clean, or the packages can be used in a script.
If we run
status again after
clean, we can see that Packrat still keeps track of packages after they’ve been removed from the private library.
The following packages are tracked by packrat, but are no longer available in the local library nor present in your code: _ bitops 1.0-6 digest 0.6.4 httr 0.3 pushoverr 0.1.3 RCurl 1.95-4.1 You can call packrat::snapshot() to remove these packages from the lockfile, or if you intend to use these packages, use packrat::restore() to restore them to your private library.
snapshot will tell Packrat to stop monitoring any packages that are no longer installed. The private library can then be
cleaned. Otherwise, if we’ve decided that we do want to use pushoverr, we can re-install it and its dependencies with
Packages can be removed from the private library using
remove.packages. For example, we can remove the pushoverr package from our private library:
If pushoverr is still used by a script, Packrat will let you know:
Error in getPackageRecords(inferredPkgNames, project = project, lib.loc = lib.loc, : Unable to retrieve package records for the following pacakges: - 'pushoverr'
Once an uninstalled package is no longer being used in your scripts, Packrat will give you the option to remove that package and its dependencies from your private library by running
The following packages are tracked by packrat, but are no longer available in the local library nor present in your code: _ pushoverr 0.1.3 You can call packrat::snapshot() to remove these packages from the lockfile, or if you intend to use these packages, use packrat::restore() to restore them to your private library. The following packages are installed but not needed: _ bitops 1.0-6 digest 0.6.4 httr 0.3 RCurl 1.95-4.1 Use packrat::clean() to remove them. Or, if they are actually needed by your project, add `library(packagename)` calls to a .R file somewhere in your project.
Bundling Up the Project
Once the project has reached a state where it’s ready to be shared (or moved to your other computer), it can be “bundled” up using Packrat’s
The packrat project has been bundled at: - "/home/bdc/testprojet/packrat/bundles/testproject-2014-07-15.tar.gz"
We can now share the resulting file,
Opening A Bundled Project
There are two ways a bundled project can be “unbundled”. If Packrat is installed on the target machine, the
unbundle function can be used.
Otherwise, the bundled project can simply be untarred and un-gzipped using most file archiving projects. Mac OS X users can double click on the bundled file in Finder. Users of Linux or other Unix-like systems (including OS X) can unbundle the project from the command line.
tar -xzf testproject-2014-07-15.tar.gz
This will unbundle the project into the current directory. If we want to mimic the behavior of
unbundle from before and extract the project into
We can also mimic the previous
unbundle call and specify a location to install the project, we can add that to the command:
tar -xzf testproject-2014-07-15.tar.gz --directory /home/bob/projects
With either option, the testproject directory will no be available on the target machine. When R is started from that directory, Packrat will do its magic and make sure the software environment is the same as on the source machine.
That covers the basics of Packrat! The
unbundle commands should be all you need to maintain projects whose software environment can be easily reproduced. Packrat does offer some additional functions for more directly managing the state of private libraries (see
?packrat::snapshot) as well as capabilities for easily moving in and out of Packrat projects (see
?packrat::packrat_mode for details).
Because Packrat is being developed by the fantastic people at RStudio, it is also nicely integrated into the latest versions of RStudio. If that’s the environment you use, be sure to check out their guide for Using Packrat with RStudio.
- virtualenv for Python