Tracking private R dependencies with packrat & git submodules

[This article was first published on Rstats on pi: predict/infer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here at Methods we often use RStudio’s packrat package to version our package dependencies and help ensure our work is reproducible. Packrat handles public packages on CRAN or Github just fine, but we have a lot of internal packages hosted privately on Gitlab that we’d like to have packrat manage like the rest of our dependencies. This comes up very naturally for us, as we often make client-specific R packages we then want to use in other work for that client.

There are several options for working with private packages in packrat, some of which I’ll discuss more at the end of this post, but the solution we use is to bring the private package into our project as a git submodule.

Git submodules are a way of including one git repository inside of another, where the submodule is a pointer to some other git repository. We can make the private package a submodule of our project, then tell packrat that we have local packages inside our project. To ensure this is portable, so other people can reproduce our work, we have to be careful when setting up packrat. I’ve prepared a demonstration of this in a repo on Gitlab, and we’ll walk through the steps below.

A Real(-ish) Example

We generally don’t use packrat from the start of a project; it adds enough overhead to the workflow that it isn’t worth enabling by default, so we wait until we care enough about reproducibility to suffer the workflow overhead before turning packrat on. Thus we’ll begin this demo by starting a small example project without packrat.

If you go to this commit you’ll see the initial state of our demo project. There’s a README, an .Rproj file, and a very brief analysis.R file:

# analysis.R
library(chifishr)

# from the examples
chi_fisher_p(treatment, "outcome", "treatment")
chi_fisher_p(treatment, "gender", "treatment")

Our project depends on chifishr (from Caleb’s series of posts on R Packages + Gitlab), plus all the packages that depends on, all their dependencies, and so forth.

If we now decide this is worth packrat-ing, we can enable packrat by calling packrat::init() in an R console. You’ll need packrat first, which you can get from CRAN via install.packages("packrat"). You can also do this through RStudio by going to the “Project Options” if you’re using an RStudio project (which I highly recommend; we have an .Rproj for every R project and package at Methods).

This is what I see:

> packrat::init()
Initializing packrat project in directory:
- "~/Code/claytonjy/packrat-submodule-demo"

Adding these packages to packrat:
               _           
    BH           1.66.0-1  
    R6           2.2.2     
    Rcpp         0.12.18   
    assertthat   0.2.0     
    bindr        0.1.1     
    bindrcpp     0.2.2     
    chifishr     0.0.0.9000
    cli          1.0.0     
    crayon       1.3.4     
    dplyr        0.7.6     
    fansi        0.3.0     
    glue         1.3.0     
    magrittr     1.5       
    packrat      0.4.9-3   
    pillar       1.3.0     
    pkgconfig    2.0.2     
    plogr        0.2.0     
    purrr        0.2.5     
    rlang        0.2.2     
    tibble       1.4.2     
    tidyselect   0.2.4     
    utf8         1.1.4     

Fetching sources for BH (1.66.0-1) ... OK (CRAN current)
Fetching sources for R6 (2.2.2) ... OK (CRAN current)
Fetching sources for Rcpp (0.12.18) ... OK (CRAN current)
Fetching sources for assertthat (0.2.0) ... OK (CRAN current)
Fetching sources for bindr (0.1.1) ... OK (CRAN current)
Fetching sources for bindrcpp (0.2.2) ... OK (CRAN current)
Fetching sources for chifishr (0.0.0.9000) ... FAILED
Fetching sources for cli (1.0.0) ... OK (CRAN current)
Fetching sources for crayon (1.3.4) ... OK (CRAN current)
Fetching sources for dplyr (0.7.6) ... OK (CRAN current)
Fetching sources for fansi (0.3.0) ... OK (CRAN current)
Fetching sources for glue (1.3.0) ... OK (CRAN current)
Fetching sources for magrittr (1.5) ... OK (CRAN current)
Fetching sources for packrat (0.4.9-3) ... OK (CRAN current)
Fetching sources for pillar (1.3.0) ... OK (CRAN current)
Fetching sources for pkgconfig (2.0.2) ... OK (CRAN current)
Fetching sources for plogr (0.2.0) ... OK (CRAN current)
Fetching sources for purrr (0.2.5) ... OK (CRAN current)
Fetching sources for rlang (0.2.2) ... OK (CRAN current)
Fetching sources for tibble (1.4.2) ... OK (CRAN current)
Fetching sources for tidyselect (0.2.4) ... OK (CRAN current)
Fetching sources for utf8 (1.1.4) ... OK (CRAN current)
Error in snapshotSources(project, activeRepos(project), allRecordsFlat) : 
  Errors occurred when fetching source files:
Error in getSourceForPkgRecord(pkgRecord, sourceDir, availablePkgs, repos) : 
  Failed to retrieve package sources for chifishr 0.0.0.9000 from CRAN (internet connectivity issue?)
In addition: Warning message:
In FUN(X[[i]], ...) :
  Package 'chifishr 0.0.0.9000' was installed from sources; Packrat will assume this package is available from a CRAN-like repository during future restores

While you might think packrat successfully captured all dependencies besides chifishr, it turns out the failure on chifishr means packrat hasn’t been initialized at all:

> packrat::status()
Error: This project has not yet been packified.
Run 'packrat::init()' to init packrat.

(though it did modify my .gitignore)

It turns out packrat doesn’t handle source-not-available very gracefully. This is where submodules come in!

At the command line (or a Console tab in RStudio), we use git submodule add to add the chifishr repo as a submodule of our project’s repo. We’ll stick it in a packages/ folder to keep things organized; this is particularly useful if you have more than one submodule’d dependency.

git submodule add https://gitlab.com/scheidec/chifishr.git ./packages/chifishr

You can see the result in this commit. If you click into packages/ you’ll see chifishr doesn’t look like a normal folder; it is associated with the specific commit we cloned from (most recent on master by default), and if you click on it you’ll be transported to the Gitlab page for chifishr (because a submodule is just a pointer!).

Now we need to tell packrat we have a local repo of packages in our packages/ folder. This means we need to give one extra argument to packrat::init().

This does NOT work well through the RStudio UI because it will only let you pick full paths, which are specific to your system and thus not portable (I bet you don’t have ~/claytonjy/Code/ on your machine!). To make sure other people can use this project, we need to have a relative path from the root of our project, e.g. ./packages or just packages.

> packrat::init(options = list(local.repos = c("packages")))
Initializing packrat project in directory:
- "~/Code/claytonjy/packrat-submodule-demo"

Adding these packages to packrat:
               _           
    BH           1.66.0-1  
    R6           2.2.2     
    Rcpp         0.12.18   
    assertthat   0.2.0     
    backports    1.1.2     
    base64enc    0.1-3     
    bindr        0.1.1     
    bindrcpp     0.2.2     
    chifishr     0.0.0.9000
    cli          1.0.0     
    crayon       1.3.4     
    digest       0.6.16    
    dplyr        0.7.6     
    evaluate     0.11      
    fansi        0.3.0     
    glue         1.3.0     
    highr        0.7       
    htmltools    0.3.6     
    jsonlite     1.5       
    knitr        1.20      
    magrittr     1.5       
    markdown     0.8       
    mime         0.5       
    packrat      0.4.9-3   
    pillar       1.3.0     
    pkgconfig    2.0.2     
    plogr        0.2.0     
    praise       1.0.0     
    purrr        0.2.5     
    rlang        0.2.2     
    rmarkdown    1.10      
    rprojroot    1.3-2     
    stringi      1.2.4     
    stringr      1.3.1     
    testthat     2.0.0     
    tibble       1.4.2     
    tidyselect   0.2.4     
    tinytex      0.7       
    utf8         1.1.4     
    withr        2.1.2     
    xfun         0.3       
    yaml         2.2.0     

Fetching sources for BH (1.66.0-1) ... OK (CRAN current)
# output truncated
Fetching sources for chifishr (0.0.0.9000) ... OK (local)
# output truncated
Fetching sources for yaml (2.2.0) ... OK (CRAN current)
Snapshot written to '/home/claytonjy/Code/claytonjy/packrat-submodule-demo/packrat/packrat.lock'
Installing BH (1.66.0-1) ... 
    OK (built source)
# output truncated
Installing chifishr (0.0.0.9000) ... 
    OK (built source)
Initialization complete!

Restarting R session...

And we can check this actually worked, too:

> packrat::status()
Up to date.

Awesome! Now our repository looks like this, with the packrat/ subfolder, a special .Rprofile created by packrat, and a new line in our .gitignore so we don’t commit built versions of our packages to git.

It’s also interesting to look at the chifishr entry in packrat/packrat.lock as it’s a bit different from the others. This file is where packrat keeps track of each package our project depends on, along with metadata like version and where it came from.

Package: chifishr
Source: source
Version: 0.0.0.9000
Hash: 4ed916f9c88ee65137ef51b5e0c70cc9
Requires: dplyr, magrittr, purrr
SourcePath: packages/chifishr

So packrat knows which version we’re using, which commit, and where we’ve kept it. That last line is important, and what would be non-portable if we had used the RStudio interface to initialize packrat.

Now to share this package, a coworker simply needs to git clone and open the .Rproj file in RStudio. Packrat will work it’s magic to install all our dependencies on their machine (without clobbering their system-wide packages), from source. That might take a bit, but then source("analysis.R") will work using the exact same code we used, down to the dependencies. That’s reproducibility!

You can also share by bundling with packrat::bundle(). Internally, we collaborate over git, but bundling is nice for handing off code to clients who don’t have access to our git repos.

While we used a public package here for ease of demonstration, this works with private packages (that’s the whole point) and will even work if the next user doesn’t have permissions to the git repo of the private package! This is because, by default, packrat stores the source code for all dependencies in packrat/src/, which is all packrat needs to install chifishr somewhere else.

Updating

Suppose development continues on chifishr, and we want to bring those updates into our project. This means we need to update both our submodule and our packrat lockfile.

If you aren’t the person that added the submodule, you’ll first need to fetch it. When git clone-ing something with a submodule, the submodule is just a pointer to where the code is, and that code isn’t brought in locally. We can get that code with a single git command to both initialize and update our submodule:

git submodule update --init

That will go fetch the code from the repo(s) that hold our submodule(s). For this to work, you do need access to those repositories.

We can then navigate into the submodule and work with it like any other git repo:

cd packages/chifishr
git pull

Then we can install the updated package and save that version to packrat with a pair of commands in R:

packrat::install("packages/chifishr")
packrat::snapshot()

If you look at git status after all that (from the project, not the submodule), you’ll see updates to both the packrat/ folder and .gitmodules; be sure to commit both!

Ignoring Source

If you choose to ignore source files (something we do a lot to save space in our repositories), each new user will need to git submodule update --init after cloning but before opening the .Rproj and letting packrat initialize. This is because that’s the only place to get the source from, and packrat will crash and burn if it can’t find the source, a bit like our first attempt at packrat::init() above.

If you deploy this project elsewhere (perhaps it’s a Shiny app that builds automatically with a CI service like Travis or Gitlab’s built-in CI) without source, you need to make sure a similar submodule-updating step occurs. In Gitlab CI, this means setting a variable like GIT_SUBMODULE_STRATEGY; see here for more info.

Compatibility

While we used a Gitlab repository for both our project and our dependency, nothing here was specific to Gitlab, and should work regardless of where you host your code, e.g. GitHub, Bitbucket, internally, etc. Note that if your dependency is a public repository, the submodule approach isn’t necessary, since any internet-connected computer could clone the code.

Other Approaches

Using submodules isn’t the only way to use packrat with private packages; there are at least two others.

We could tell packrat that our chifishr package is an external one, which will tell Packrat to expect it to exist on the system and prevent it from storing a copy like it does for other packages. I don’t like this because it breaks reproducibility by relying on each user having the right version of the package installed. This might be less concerning for older, more stable packages, but our internal packages often change rapidly enough that having the exact same version everywhere a project is used is essential.

Another option is to make a CRAN-like repository with our private package in it, host that somewhere accessible to anywhere we’d want to use the package from (e.g. a server or shared drive on the company intranet), and then point packrat at that. Packrat provides great instructions on how to do this, but it would require us to maintain some infrastructure. At the very least we’d need to manage the place we put this CRAN-like repository, and we’d also want to make sure that repository gets updated versions of our private package (e.g. using CI/CD). I think this is a fine option for some people, particularly if you don’t mind the upkeep and don’t want to teach your coworkers about submodules, but at Methods we strongly prefer the no-infrastructure approach of submodules.

To leave a comment for the author, please follow the link and comment on their blog: Rstats on pi: predict/infer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)