Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Faster package installation

Every few weeks or so, a tweet pops up asking about how to speed up package installation in R

Depending on the luck of twitter, the author may get a few suggestions.

The bigger picture is that package installation time is starting to become more of an issue for a number of reasons. For example, packages are getting larger and more complex (tidyverse and friends), so installation just takes longer. Or we are using more continuous integration strategies such as Travis or GitLab-CI, and want quick feedback. Or we are simply updating a large number of packages via update.packages(). This is a problem we often solve for our clients – optimising their CI/CD pipelines.

The purpose of this blog post is to pull together a few different methods for tackling this problem. If I’ve missed any, let me know (https://twitter.com/csgillespie)!

## Faster installation with Ncpus

The first tactic you should use is the Ncpus argument in install.packages() and update.packages(). This installs packages in parallel. It doesn’t speed up an individual package installs, but it does allow dependencies to install in parallel, e.g. tidyverse. Using it is easy; it’s just an additional argument in install.packages(). So to use six cores, we would simply use

install.packages("tidyverse", Ncpus = 6)

When installing a fresh version of the tidyverse and all dependencies, this can give a two-fold speed-up.

Ncpus Elapsed (Secs) Ratio
1 409 2.26
2 224 1.24
4 196 1.08
6 181 1.00

Not bad for a simple tweak with no downsides. For further information, see our blog post
from a few years ago.

In short, this is something you should definitely use and add to your .Rprofile. It would in theory speed-up continuous integration pipelines, but only if you have multiple cores available. The free version of travis only comes with a single core, but if you hook up a multi-core Kubernettes cluster to your CI (we sometimes do this at Jumping Rivers), then you can achieve a large speed-up.

## Faster installation with ccache

If you are installing packages from source, i.e. tar.gz files, then most of the installation time is spent on compiling source code, such as C, C++ & Fortran. A few years ago, Dirk Eddelbuettel wrote a great blog post on leveraging the ccache utility for reducing the compile time step. Essentially, ccache stores the resulting object file created when compiling. If that file is ever compiled again, instead of rebuilding, ccache returns the object code, resulting in a significant speed up. It’s the classic trade-off between memory (caching) and CPU.

Dirk’s post gives clear details on how to implement ccache (so I won’t repeat). He also compares re-installation times of packages, with RQuantlib going from 500 seconds to a few seconds. However, for ccache to be effective, the source files have to be static. Obviously, when you update an R package things change!

As an experiment, I download the last seventeen versions of dplyr from CRAN. This takes us back to version 0.5.0 from 2016. Next I installed each version in turn, via

# Avoid tidyverse packages, as we are messing about with dplyr
f = list.files("data", full.names = TRUE)
elapsed = numeric(length(f))
for (i in seq_along(f)) {
elapsed[i] = system.time(install.packages(f[i], repos = NULL))["elapsed"]
}


As all packages dependencies have been installed and the source code has already been downloaded, the above code times the installing of just dplyr. If we then implement ccache, we can easily rerun the above code. After a little manipulation we can plot the absolute installation times

The first (slightly obvious) takeaway is that there is no speed-up with dplyr v0.5.0. This is simply because ccache relies on previous installations. As v0.5.0 is the first version in our study, there is no difference between standard and ccache installations.

Over the seventeen versions of dplyr, we achieved a 24 fold speed-up for three versions, and more modest two to four fold speed-up for a further three versions. Averaged over all seventeen version, a typical speed-up is around 50%.

Overall, using ccache is a very effective and easy strategy. It requires a single, simple set-up, and doesn’t require root access. Of course it doesn’t always work, but it never really slows anything down.

At the start of this section, I mentioned the trade off between memory and CPU. I’ve been using ccache since 2017, and the current cache size is around 6GB. Which on a modern hard drive isn’t much (and I install a lot of packages)!

## Using Ubuntu Binaries

On Linux, the standard way of installing packages is via source and install.packages(). However, it is also possible to install packages using binary packages. This has two main benefits

• It’s faster – typically a few seconds
• It (usually) solves any horrible dependency problems by installing the necessary dev-libraries.

If you are using continuous integration, such as GitLab runners, then this is a straightforward step to reduce the package installation time. The key idea is to add an additional binary source to your source.lists file, see for example, the line in rocker. After that, you can install most CRAN packages via

sudo apt install r-cran-dplyr

The one big downside here is that the user requires root access to install an R package, so this solution isn’t suitable in all situations.

There’s lots of documentation available, CRAN and blog posts, so I won’t bother

## Using RStudio Package Manager

The RStudio Package Manager is one of RStudio’s Pro products that is used to ultimately pay for their open source work, e.g. the RStudio desktop IDE and all of their tidyverse R packages.

CRAN mirrors have for a long time distributed binary packages for Windows and Mac. The RSPM provides precompiled binaries for CRAN packages for

• Ubuntu 16.04 (Xenial), Ubuntu 18.04 (Bionic)
• CentOS/RHEL 7, CentOS/RHEL 8
• openSUSE 42/SLES 12, openSUSE 15/SLES 15
• Windows (soon, currently in beta)

The big advantage of RSPM over the Ubuntu binaries solution above, is that root access is no longer necessary. Users can just install via the usual install.packages().

Jumping Rivers are full service, RStudio certified partners. Part of our role is to offer support in RStudio Pro products. If you use any RStudio Pro products, feel free to contact us (). We may be able to offer free support.

The post Faster R package installation appeared first on Jumping Rivers.