GitLab CI for R-package development

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

— A basic R-phile introduction to continuous integration on GitLab

I have been using the GitLab repository for some time for mainly two reasons: I can have private projects at no monetary costs (I later came to realise that I as an academic can have the same on GitHub), and most importantly GitLab has so far gone under the radar of our IT department, meaning I can access it from my work computer. GitHub on the other hand is flagged as file sharing.

A simple CI config

Most of my time with R is spend trying to make heads and tails of various kinds of data, and I have so far just authored one R-package. While I can see the benefits of a continuous integration (CI) work flow, I just never bothered to actually set it up. Now where I am putting together code in smaller packages for internal use, it seemed like the right time to learn a little.

The Internet gives a few pointers on how to go about setting up CI on GitLab; one of the resources is the blog post Docker, GitLab CI and Developing R Packages by Mustafa Hasanbulli, who gives a simple [code]]czoxNDpcIi5naXRsYWItY2kueW1sXCI7e1smKiZdfQ==[[/code] for testing packages. Mustafa’s solution make use of the [code]]czoxNjpcInJvY2tlci90aWR5dmVyc2VcIjt7WyYqJl19[[/code] Docker image and install the dependency packages before running [code]]czo3OlwiY2hlY2soKVwiO3tbJiomXX0=[[/code] from [code]]czo4OlwiZGV2dG9vbHNcIjt7WyYqJl19[[/code]. It’s a good solution and combining with the [code]]czoxNDpcIi5naXRsYWItY2kueW1sXCI7e1smKiZdfQ==[[/code] shared as a gist on Github by Artem Klevtsov, I managed to get the coverage badge I though nice to have. The [code]]czoxNDpcIi5naXRsYWItY2kueW1sXCI7e1smKiZdfQ==[[/code] for a smaller package can be along the lines of:

image: rocker/tidyverse

stages:
  - check
  - coverage

check_pkg:
  stage: check
  script:
    - R -e 'install.packages(c())'
    - R -e 'devtools::check()'

coverage:
   stage: coverage
   script:
     - R -e 'covr::package_coverage(type = c("tests", "examples"))'

To extract the coverage to the coverage badge, add [code]]czoxOTpcIkNvdmVyYWdlOiBcXGQrLlxcZCslJFwiO3tbJiomXX0=[[/code] to the section ‘Test coverage parsing’ in Settings -> CI/CD -> General pipelines.

Introducing cache

For my package, each of the two stages took about 45 minutes to complete, and I realized that the wast majority of the time was spent on downloading and especially installing packages. This was mainly do to the Bioconductor packages I rely on.

If only there would be a way to pass the installed packages between the stages, or even between runs of the CI pipeline. There is – GitLab 9.0 saw the option to specify a cache. The next problem is that the cache must be a directory of the cloned project directory. Since R prefers to install packages in [code]]czoxODpcIi91c3IvbGliL1IvbGlicmFyeVwiO3tbJiomXX0=[[/code] in the Docker images, the [code]]czoxMTpcIi5saWJwYXRocygpXCI7e1smKiZdfQ==[[/code] must be changed. In addition you would have to remember to add any new package to the [code]]czoxNDpcIi5naXRsYWItY2kueW1sXCI7e1smKiZdfQ==[[/code]. Which I for one would always forget, and therefore painstakingly have to figure out which packages to add.

A much simpler solution is to use [code]]czo3OlwicGFja3JhdFwiO3tbJiomXX0=[[/code] – something you anyway should consider to use. It also allows you to use the [code]]czoxMzpcInJvY2tlci9yLWJhc2VcIjt7WyYqJl19[[/code] image and just the packages actually required for your CI. How much of a win in terms of traffic [code]]czoxMzpcInJvY2tlci9yLWJhc2VcIjt7WyYqJl19[[/code] is over [code]]czoxNjpcInJvY2tlci90aWR5dmVyc2VcIjt7WyYqJl19[[/code] probably depends on the packages you have to add. The [code]]czoxNDpcIi5naXRsYWItY2kueW1sXCI7e1smKiZdfQ==[[/code] caching packages could look like this:

image: rocker/r-base

stages:
  - setup
  - test

cache:
  # Ommit key to use the same cache across all pipelines and branches
  key: "$CI_COMMIT_REF_SLUG"
  paths:
    - packrat/lib/

setup:
  stage: setup
  script:
    - R -e 'source("ci.R"); ci_setup()'

check:
  stage: test
  dependencies:
    - setup
  when: on_success
  script:
    - R -e 'source("ci.R"); ci_check()'

coverage:
  stage: test
  dependencies:
    - setup
  when: on_success
  only:
    - master
  script:
    - R -e 'source("ci.R"); ci_coverage()'

with the [code]]czo0OlwiY2kuUlwiO3tbJiomXX0=[[/code] looking like this:

install_if_needed <- function(package_to_install){
  package_path <- find.package(package_to_install, quiet = TRUE)

  if(length(package_path) == 0){
    # Only install if not present
    install.packages(package_to_install)
  }
}

ci_setup <- function(){
  install_if_needed("packrat")
  packrat::restore()
}

ci_check <- function(){
  install_if_needed("devtools")
  devtools::check()
}

ci_coverage <- function(){
  install_if_needed("covr")
  covr::package_coverage(type = c("tests", "examples"))
}

The cache key [code]]czoxOTpcIiRDSV9DT01NSVRfUkVGX1NMVUdcIjt7WyYqJl19[[/code] gives you the advantage of different cache for different branches. Using [code]]czoxNDpcIiRDSV9DT01NSVRfU0hBXCI7e1smKiZdfQ==[[/code] will give you a separate cache for each commit.

Adding the [code]]czo3OlwicGFja3JhdFwiO3tbJiomXX0=[[/code] subdirectories [code]]czozOlwic3JjXCI7e1smKiZdfQ==[[/code] and [code]]czo0OlwibGliKlwiO3tbJiomXX0=[[/code] to the [code]]czoxMDpcIi5naXRpZ25vcmVcIjt7WyYqJl19[[/code] will keep your repository small – and I find it quite useful to commit just the [code]]czoxMjpcInBhY2tyYXQubG9ja1wiO3tbJiomXX0=[[/code] whenever I add or remove a package. But then again, I am the only one working with my repositories, and there might be advantages I don’t know of.

I have noticed that the stages after the setup stage sometimes fail in the first run. If this happens because of the cache, rerunning the failed stage makes everything well.

Using the above for my package, the first run of the pipeline took about 45 minutes, but the second run only about 8 minutes. A considerable reduction in time.

I hope [code]]czoxNDpcIi5naXRsYWItY2kueW1sXCI7e1smKiZdfQ==[[/code] and [code]]czo0OlwiY2kuUlwiO3tbJiomXX0=[[/code] outlined here will help you getting started on caching your R-packages in your CI. The two modules are quite simple, and if you are loking for something more sophisticated, I can recommend looking Matt Dowle works on [code]]czoxMDpcImRhdGEudGFibGVcIjt7WyYqJl19[[/code] and of course the GitLab Runner help pages.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)