Tracking private R dependencies with packrat & git submodules
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Here at Methods we often use RStudio’s packrat
package to version our package dependencies and help ensure our work is reproducible. Packrat handles public packages on CRAN or Github just fine, but we have a lot of internal packages hosted privately on Gitlab that we’d like to have packrat
manage like the rest of our dependencies. This comes up very naturally for us, as we often make client-specific R packages we then want to use in other work for that client.
There are several options for working with private packages in packrat
, some of which I’ll discuss more at the end of this post, but the solution we use is to bring the private package into our project as a git submodule.
Git submodules are a way of including one git repository inside of another, where the submodule is a pointer to some other git repository. We can make the private package a submodule of our project, then tell packrat
that we have local packages inside our project. To ensure this is portable, so other people can reproduce our work, we have to be careful when setting up packrat
. I’ve prepared a demonstration of this in a repo on Gitlab, and we’ll walk through the steps below.
A Real(-ish) Example
We generally don’t use packrat
from the start of a project; it adds enough overhead to the workflow that it isn’t worth enabling by default, so we wait until we care enough about reproducibility to suffer the workflow overhead before turning packrat
on. Thus we’ll begin this demo by starting a small example project without packrat
.
If you go to this commit you’ll see the initial state of our demo project. There’s a README, an .Rproj
file, and a very brief analysis.R
file:
# analysis.R library(chifishr) # from the examples chi_fisher_p(treatment, "outcome", "treatment") chi_fisher_p(treatment, "gender", "treatment")
Our project depends on chifishr
(from Caleb’s series of posts on R Packages + Gitlab), plus all the packages that depends on, all their dependencies, and so forth.
If we now decide this is worth packrat
-ing, we can enable packrat
by calling packrat::init()
in an R console. You’ll need packrat
first, which you can get from CRAN via install.packages("packrat")
. You can also do this through RStudio by going to the “Project Options” if you’re using an RStudio project (which I highly recommend; we have an .Rproj
for every R project and package at Methods).
This is what I see:
> packrat::init() Initializing packrat project in directory: - "~/Code/claytonjy/packrat-submodule-demo" Adding these packages to packrat: _ BH 1.66.0-1 R6 2.2.2 Rcpp 0.12.18 assertthat 0.2.0 bindr 0.1.1 bindrcpp 0.2.2 chifishr 0.0.0.9000 cli 1.0.0 crayon 1.3.4 dplyr 0.7.6 fansi 0.3.0 glue 1.3.0 magrittr 1.5 packrat 0.4.9-3 pillar 1.3.0 pkgconfig 2.0.2 plogr 0.2.0 purrr 0.2.5 rlang 0.2.2 tibble 1.4.2 tidyselect 0.2.4 utf8 1.1.4 Fetching sources for BH (1.66.0-1) ... OK (CRAN current) Fetching sources for R6 (2.2.2) ... OK (CRAN current) Fetching sources for Rcpp (0.12.18) ... OK (CRAN current) Fetching sources for assertthat (0.2.0) ... OK (CRAN current) Fetching sources for bindr (0.1.1) ... OK (CRAN current) Fetching sources for bindrcpp (0.2.2) ... OK (CRAN current) Fetching sources for chifishr (0.0.0.9000) ... FAILED Fetching sources for cli (1.0.0) ... OK (CRAN current) Fetching sources for crayon (1.3.4) ... OK (CRAN current) Fetching sources for dplyr (0.7.6) ... OK (CRAN current) Fetching sources for fansi (0.3.0) ... OK (CRAN current) Fetching sources for glue (1.3.0) ... OK (CRAN current) Fetching sources for magrittr (1.5) ... OK (CRAN current) Fetching sources for packrat (0.4.9-3) ... OK (CRAN current) Fetching sources for pillar (1.3.0) ... OK (CRAN current) Fetching sources for pkgconfig (2.0.2) ... OK (CRAN current) Fetching sources for plogr (0.2.0) ... OK (CRAN current) Fetching sources for purrr (0.2.5) ... OK (CRAN current) Fetching sources for rlang (0.2.2) ... OK (CRAN current) Fetching sources for tibble (1.4.2) ... OK (CRAN current) Fetching sources for tidyselect (0.2.4) ... OK (CRAN current) Fetching sources for utf8 (1.1.4) ... OK (CRAN current) Error in snapshotSources(project, activeRepos(project), allRecordsFlat) : Errors occurred when fetching source files: Error in getSourceForPkgRecord(pkgRecord, sourceDir, availablePkgs, repos) : Failed to retrieve package sources for chifishr 0.0.0.9000 from CRAN (internet connectivity issue?) In addition: Warning message: In FUN(X[[i]], ...) : Package 'chifishr 0.0.0.9000' was installed from sources; Packrat will assume this package is available from a CRAN-like repository during future restores
While you might think packrat
successfully captured all dependencies besides chifishr
, it turns out the failure on chifishr
means packrat
hasn’t been initialized at all:
> packrat::status() Error: This project has not yet been packified. Run 'packrat::init()' to init packrat.
(though it did modify my .gitignore
)
It turns out packrat
doesn’t handle source-not-available very gracefully. This is where submodules come in!
At the command line (or a Console tab in RStudio), we use git submodule add
to add the chifishr
repo as a submodule of our project’s repo. We’ll stick it in a packages/
folder to keep things organized; this is particularly useful if you have more than one submodule’d dependency.
git submodule add https://gitlab.com/scheidec/chifishr.git ./packages/chifishr
You can see the result in this commit. If you click into packages/
you’ll see chifishr
doesn’t look like a normal folder; it is associated with the specific commit we cloned from (most recent on master
by default), and if you click on it you’ll be transported to the Gitlab page for chifishr
(because a submodule is just a pointer!).
Now we need to tell packrat
we have a local repo of packages in our packages/
folder. This means we need to give one extra argument to packrat::init()
.
This does NOT work well through the RStudio UI because it will only let you pick full paths, which are specific to your system and thus not portable (I bet you don’t have ~/claytonjy/Code/
on your machine!). To make sure other people can use this project, we need to have a relative path from the root of our project, e.g. ./packages
or just packages
.
> packrat::init(options = list(local.repos = c("packages"))) Initializing packrat project in directory: - "~/Code/claytonjy/packrat-submodule-demo" Adding these packages to packrat: _ BH 1.66.0-1 R6 2.2.2 Rcpp 0.12.18 assertthat 0.2.0 backports 1.1.2 base64enc 0.1-3 bindr 0.1.1 bindrcpp 0.2.2 chifishr 0.0.0.9000 cli 1.0.0 crayon 1.3.4 digest 0.6.16 dplyr 0.7.6 evaluate 0.11 fansi 0.3.0 glue 1.3.0 highr 0.7 htmltools 0.3.6 jsonlite 1.5 knitr 1.20 magrittr 1.5 markdown 0.8 mime 0.5 packrat 0.4.9-3 pillar 1.3.0 pkgconfig 2.0.2 plogr 0.2.0 praise 1.0.0 purrr 0.2.5 rlang 0.2.2 rmarkdown 1.10 rprojroot 1.3-2 stringi 1.2.4 stringr 1.3.1 testthat 2.0.0 tibble 1.4.2 tidyselect 0.2.4 tinytex 0.7 utf8 1.1.4 withr 2.1.2 xfun 0.3 yaml 2.2.0 Fetching sources for BH (1.66.0-1) ... OK (CRAN current) # output truncated Fetching sources for chifishr (0.0.0.9000) ... OK (local) # output truncated Fetching sources for yaml (2.2.0) ... OK (CRAN current) Snapshot written to '/home/claytonjy/Code/claytonjy/packrat-submodule-demo/packrat/packrat.lock' Installing BH (1.66.0-1) ... OK (built source) # output truncated Installing chifishr (0.0.0.9000) ... OK (built source) Initialization complete! Restarting R session...
And we can check this actually worked, too:
> packrat::status() Up to date.
Awesome! Now our repository looks like this, with the packrat/
subfolder, a special .Rprofile
created by packrat
, and a new line in our .gitignore
so we don’t commit built versions of our packages to git.
It’s also interesting to look at the chifishr
entry in packrat/packrat.lock
as it’s a bit different from the others. This file is where packrat
keeps track of each package our project depends on, along with metadata like version and where it came from.
Package: chifishr Source: source Version: 0.0.0.9000 Hash: 4ed916f9c88ee65137ef51b5e0c70cc9 Requires: dplyr, magrittr, purrr SourcePath: packages/chifishr
So packrat knows which version we’re using, which commit, and where we’ve kept it. That last line is important, and what would be non-portable if we had used the RStudio interface to initialize packrat.
Now to share this package, a coworker simply needs to git clone
and open the .Rproj
file in RStudio. Packrat will work it’s magic to install all our dependencies on their machine (without clobbering their system-wide packages), from source. That might take a bit, but then source("analysis.R")
will work using the exact same code we used, down to the dependencies. That’s reproducibility!
You can also share by bundling with packrat::bundle()
. Internally, we collaborate over git, but bundling is nice for handing off code to clients who don’t have access to our git repos.
While we used a public package here for ease of demonstration, this works with private packages (that’s the whole point) and will even work if the next user doesn’t have permissions to the git repo of the private package! This is because, by default, packrat
stores the source code for all dependencies in packrat/src/
, which is all packrat
needs to install chifishr
somewhere else.
Updating
Suppose development continues on chifishr
, and we want to bring those updates into our project. This means we need to update both our submodule and our packrat
lockfile.
If you aren’t the person that added the submodule, you’ll first need to fetch it. When git clone
-ing something with a submodule, the submodule is just a pointer to where the code is, and that code isn’t brought in locally. We can get that code with a single git
command to both initialize and update our submodule:
git submodule update --init
That will go fetch the code from the repo(s) that hold our submodule(s). For this to work, you do need access to those repositories.
We can then navigate into the submodule and work with it like any other git
repo:
cd packages/chifishr git pull
Then we can install the updated package and save that version to packrat with a pair of commands in R:
packrat::install("packages/chifishr") packrat::snapshot()
If you look at git status
after all that (from the project, not the submodule), you’ll see updates to both the packrat/
folder and .gitmodules
; be sure to commit both!
Ignoring Source
If you choose to ignore source files (something we do a lot to save space in our repositories), each new user will need to git submodule update --init
after cloning but before opening the .Rproj
and letting packrat
initialize. This is because that’s the only place to get the source from, and packrat
will crash and burn if it can’t find the source, a bit like our first attempt at packrat::init()
above.
If you deploy this project elsewhere (perhaps it’s a Shiny app that builds automatically with a CI service like Travis or Gitlab’s built-in CI) without source, you need to make sure a similar submodule-updating step occurs. In Gitlab CI, this means setting a variable like GIT_SUBMODULE_STRATEGY
; see here for more info.
Compatibility
While we used a Gitlab repository for both our project and our dependency, nothing here was specific to Gitlab, and should work regardless of where you host your code, e.g. GitHub, Bitbucket, internally, etc. Note that if your dependency is a public repository, the submodule approach isn’t necessary, since any internet-connected computer could clone the code.
Other Approaches
Using submodules isn’t the only way to use packrat
with private packages; there are at least two others.
We could tell packrat
that our chifishr
package is an external one, which will tell Packrat to expect it to exist on the system and prevent it from storing a copy like it does for other packages. I don’t like this because it breaks reproducibility by relying on each user having the right version of the package installed. This might be less concerning for older, more stable packages, but our internal packages often change rapidly enough that having the exact same version everywhere a project is used is essential.
Another option is to make a CRAN-like repository with our private package in it, host that somewhere accessible to anywhere we’d want to use the package from (e.g. a server or shared drive on the company intranet), and then point packrat
at that. Packrat provides great instructions on how to do this, but it would require us to maintain some infrastructure. At the very least we’d need to manage the place we put this CRAN-like repository, and we’d also want to make sure that repository gets updated versions of our private package (e.g. using CI/CD). I think this is a fine option for some people, particularly if you don’t mind the upkeep and don’t want to teach your coworkers about submodules, but at Methods we strongly prefer the no-infrastructure approach of submodules.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.