Here at Methods we often use RStudio’s
packrat package to version our package dependencies and help ensure our work is reproducible. Packrat handles public packages on CRAN or Github just fine, but we have a lot of internal packages hosted privately on Gitlab that we’d like to have
packrat manage like the rest of our dependencies. This comes up very naturally for us, as we often make client-specific R packages we then want to use in other work for that client.
There are several options for working with private packages in
packrat, some of which I’ll discuss more at the end of this post, but the solution we use is to bring the private package into our project as a git submodule.
Git submodules are a way of including one git repository inside of another, where the submodule is a pointer to some other git repository. We can make the private package a submodule of our project, then tell
packrat that we have local packages inside our project. To ensure this is portable, so other people can reproduce our work, we have to be careful when setting up
packrat. I’ve prepared a demonstration of this in a repo on Gitlab, and we’ll walk through the steps below.
A Real(-ish) Example
We generally don’t use
packrat from the start of a project; it adds enough overhead to the workflow that it isn’t worth enabling by default, so we wait until we care enough about reproducibility to suffer the workflow overhead before turning
packrat on. Thus we’ll begin this demo by starting a small example project without
If you go to this commit you’ll see the initial state of our demo project. There’s a README, an
.Rproj file, and a very brief
# analysis.R library(chifishr) # from the examples chi_fisher_p(treatment, "outcome", "treatment") chi_fisher_p(treatment, "gender", "treatment")
If we now decide this is worth
packrat-ing, we can enable
packrat by calling
packrat::init() in an R console. You’ll need
packrat first, which you can get from CRAN via
install.packages("packrat"). You can also do this through RStudio by going to the “Project Options” if you’re using an RStudio project (which I highly recommend; we have an
.Rproj for every R project and package at Methods).
This is what I see:
> packrat::init() Initializing packrat project in directory: - "~/Code/claytonjy/packrat-submodule-demo" Adding these packages to packrat: _ BH 1.66.0-1 R6 2.2.2 Rcpp 0.12.18 assertthat 0.2.0 bindr 0.1.1 bindrcpp 0.2.2 chifishr 0.0.0.9000 cli 1.0.0 crayon 1.3.4 dplyr 0.7.6 fansi 0.3.0 glue 1.3.0 magrittr 1.5 packrat 0.4.9-3 pillar 1.3.0 pkgconfig 2.0.2 plogr 0.2.0 purrr 0.2.5 rlang 0.2.2 tibble 1.4.2 tidyselect 0.2.4 utf8 1.1.4 Fetching sources for BH (1.66.0-1) ... OK (CRAN current) Fetching sources for R6 (2.2.2) ... OK (CRAN current) Fetching sources for Rcpp (0.12.18) ... OK (CRAN current) Fetching sources for assertthat (0.2.0) ... OK (CRAN current) Fetching sources for bindr (0.1.1) ... OK (CRAN current) Fetching sources for bindrcpp (0.2.2) ... OK (CRAN current) Fetching sources for chifishr (0.0.0.9000) ... FAILED Fetching sources for cli (1.0.0) ... OK (CRAN current) Fetching sources for crayon (1.3.4) ... OK (CRAN current) Fetching sources for dplyr (0.7.6) ... OK (CRAN current) Fetching sources for fansi (0.3.0) ... OK (CRAN current) Fetching sources for glue (1.3.0) ... OK (CRAN current) Fetching sources for magrittr (1.5) ... OK (CRAN current) Fetching sources for packrat (0.4.9-3) ... OK (CRAN current) Fetching sources for pillar (1.3.0) ... OK (CRAN current) Fetching sources for pkgconfig (2.0.2) ... OK (CRAN current) Fetching sources for plogr (0.2.0) ... OK (CRAN current) Fetching sources for purrr (0.2.5) ... OK (CRAN current) Fetching sources for rlang (0.2.2) ... OK (CRAN current) Fetching sources for tibble (1.4.2) ... OK (CRAN current) Fetching sources for tidyselect (0.2.4) ... OK (CRAN current) Fetching sources for utf8 (1.1.4) ... OK (CRAN current) Error in snapshotSources(project, activeRepos(project), allRecordsFlat) : Errors occurred when fetching source files: Error in getSourceForPkgRecord(pkgRecord, sourceDir, availablePkgs, repos) : Failed to retrieve package sources for chifishr 0.0.0.9000 from CRAN (internet connectivity issue?) In addition: Warning message: In FUN(X[[i]], ...) : Package 'chifishr 0.0.0.9000' was installed from sources; Packrat will assume this package is available from a CRAN-like repository during future restores
While you might think
packrat successfully captured all dependencies besides
chifishr, it turns out the failure on
packrat hasn’t been initialized at all:
> packrat::status() Error: This project has not yet been packified. Run 'packrat::init()' to init packrat.
(though it did modify my
It turns out
packrat doesn’t handle source-not-available very gracefully. This is where submodules come in!
At the command line (or a Console tab in RStudio), we use
git submodule add to add the
chifishr repo as a submodule of our project’s repo. We’ll stick it in a
packages/ folder to keep things organized; this is particularly useful if you have more than one submodule’d dependency.
git submodule add https://gitlab.com/scheidec/chifishr.git ./packages/chifishr
You can see the result in this commit. If you click into
packages/ you’ll see
chifishr doesn’t look like a normal folder; it is associated with the specific commit we cloned from (most recent on
master by default), and if you click on it you’ll be transported to the Gitlab page for
chifishr (because a submodule is just a pointer!).
Now we need to tell
packrat we have a local repo of packages in our
packages/ folder. This means we need to give one extra argument to
This does NOT work well through the RStudio UI because it will only let you pick full paths, which are specific to your system and thus not portable (I bet you don’t have
~/claytonjy/Code/ on your machine!). To make sure other people can use this project, we need to have a relative path from the root of our project, e.g.
./packages or just
> packrat::init(options = list(local.repos = c("packages"))) Initializing packrat project in directory: - "~/Code/claytonjy/packrat-submodule-demo" Adding these packages to packrat: _ BH 1.66.0-1 R6 2.2.2 Rcpp 0.12.18 assertthat 0.2.0 backports 1.1.2 base64enc 0.1-3 bindr 0.1.1 bindrcpp 0.2.2 chifishr 0.0.0.9000 cli 1.0.0 crayon 1.3.4 digest 0.6.16 dplyr 0.7.6 evaluate 0.11 fansi 0.3.0 glue 1.3.0 highr 0.7 htmltools 0.3.6 jsonlite 1.5 knitr 1.20 magrittr 1.5 markdown 0.8 mime 0.5 packrat 0.4.9-3 pillar 1.3.0 pkgconfig 2.0.2 plogr 0.2.0 praise 1.0.0 purrr 0.2.5 rlang 0.2.2 rmarkdown 1.10 rprojroot 1.3-2 stringi 1.2.4 stringr 1.3.1 testthat 2.0.0 tibble 1.4.2 tidyselect 0.2.4 tinytex 0.7 utf8 1.1.4 withr 2.1.2 xfun 0.3 yaml 2.2.0 Fetching sources for BH (1.66.0-1) ... OK (CRAN current) # output truncated Fetching sources for chifishr (0.0.0.9000) ... OK (local) # output truncated Fetching sources for yaml (2.2.0) ... OK (CRAN current) Snapshot written to '/home/claytonjy/Code/claytonjy/packrat-submodule-demo/packrat/packrat.lock' Installing BH (1.66.0-1) ... OK (built source) # output truncated Installing chifishr (0.0.0.9000) ... OK (built source) Initialization complete! Restarting R session...
And we can check this actually worked, too:
> packrat::status() Up to date.
Awesome! Now our repository looks like this, with the
packrat/ subfolder, a special
.Rprofile created by
packrat, and a new line in our
.gitignore so we don’t commit built versions of our packages to git.
It’s also interesting to look at the
chifishr entry in
packrat/packrat.lock as it’s a bit different from the others. This file is where
packrat keeps track of each package our project depends on, along with metadata like version and where it came from.
Package: chifishr Source: source Version: 0.0.0.9000 Hash: 4ed916f9c88ee65137ef51b5e0c70cc9 Requires: dplyr, magrittr, purrr SourcePath: packages/chifishr
So packrat knows which version we’re using, which commit, and where we’ve kept it. That last line is important, and what would be non-portable if we had used the RStudio interface to initialize packrat.
Now to share this package, a coworker simply needs to
git clone and open the
.Rproj file in RStudio. Packrat will work it’s magic to install all our dependencies on their machine (without clobbering their system-wide packages), from source. That might take a bit, but then
source("analysis.R") will work using the exact same code we used, down to the dependencies. That’s reproducibility!
You can also share by bundling with
packrat::bundle(). Internally, we collaborate over git, but bundling is nice for handing off code to clients who don’t have access to our git repos.
While we used a public package here for ease of demonstration, this works with private packages (that’s the whole point) and will even work if the next user doesn’t have permissions to the git repo of the private package! This is because, by default,
packrat stores the source code for all dependencies in
packrat/src/, which is all
packrat needs to install
chifishr somewhere else.
Suppose development continues on
chifishr, and we want to bring those updates into our project. This means we need to update both our submodule and our
If you aren’t the person that added the submodule, you’ll first need to fetch it. When
git clone-ing something with a submodule, the submodule is just a pointer to where the code is, and that code isn’t brought in locally. We can get that code with a single
git command to both initialize and update our submodule:
git submodule update --init
That will go fetch the code from the repo(s) that hold our submodule(s). For this to work, you do need access to those repositories.
We can then navigate into the submodule and work with it like any other
cd packages/chifishr git pull
Then we can install the updated package and save that version to packrat with a pair of commands in R:
If you look at
git status after all that (from the project, not the submodule), you’ll see updates to both the
packrat/ folder and
.gitmodules; be sure to commit both!
If you choose to ignore source files (something we do a lot to save space in our repositories), each new user will need to
git submodule update --init after cloning but before opening the
.Rproj and letting
packrat initialize. This is because that’s the only place to get the source from, and
packrat will crash and burn if it can’t find the source, a bit like our first attempt at
If you deploy this project elsewhere (perhaps it’s a Shiny app that builds automatically with a CI service like Travis or Gitlab’s built-in CI) without source, you need to make sure a similar submodule-updating step occurs. In Gitlab CI, this means setting a variable like
GIT_SUBMODULE_STRATEGY; see here for more info.
While we used a Gitlab repository for both our project and our dependency, nothing here was specific to Gitlab, and should work regardless of where you host your code, e.g. GitHub, Bitbucket, internally, etc. Note that if your dependency is a public repository, the submodule approach isn’t necessary, since any internet-connected computer could clone the code.
Using submodules isn’t the only way to use
packrat with private packages; there are at least two others.
We could tell
packrat that our
chifishr package is an external one, which will tell Packrat to expect it to exist on the system and prevent it from storing a copy like it does for other packages. I don’t like this because it breaks reproducibility by relying on each user having the right version of the package installed. This might be less concerning for older, more stable packages, but our internal packages often change rapidly enough that having the exact same version everywhere a project is used is essential.
Another option is to make a CRAN-like repository with our private package in it, host that somewhere accessible to anywhere we’d want to use the package from (e.g. a server or shared drive on the company intranet), and then point
packrat at that. Packrat provides great instructions on how to do this, but it would require us to maintain some infrastructure. At the very least we’d need to manage the place we put this CRAN-like repository, and we’d also want to make sure that repository gets updated versions of our private package (e.g. using CI/CD). I think this is a fine option for some people, particularly if you don’t mind the upkeep and don’t want to teach your coworkers about submodules, but at Methods we strongly prefer the no-infrastructure approach of submodules.