Speeding up Stan model builds for R package developers

[This article was first published on R | Dr Tom Palmer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In my previous job my work computer was a Windows desktop – yes, those were the days before laptops and hotdesking!

My PhD student was interested in Bayesian methods and we put together an R package which included some Stan models. I was always frustrated by how slowly these compiled on our Windows machines. A few years later, when I got a MacBook Air I was shocked how much faster they compiled.

On my Windows machine our mrbayes package takes 3 minutes 55 seconds to compile and install. On my M4 MacBook Air it takes 1 minute 16 seconds.

The following tips show how to improve those timings.

To generate the timings I used

time R CMD INSTALL --preclean .

Big win 1: Enable parallel compilations with the MAKEFLAGS environment variable

Set the MAKEFLAGS environment variable in your ~/.Renviron file. This controls how many make jobs run concurrently. Choose a number no larger than the number of processing cores your machine has. To find this run

# Windows - in a Git Bash shell
echo $NUMBER_OF_PROCESSORS

# macOS
sysctl -n hw.logicalcpu

# Ubuntu Linux
nproc

A reasonable starting point is your core count, or a few fewer to leave headroom for whatever else you’re doing during a compilation. For example,

# In ~/.Renviron
MAKEFLAGS=-j6

Close and restart R/RStudio after making this change.

On my Windows machine this reduced the build from 3:55 to 1:15. To find your own sweet spot empirically, see the example at the end of Big win 2.

Big win 2: Enable C/C++ compiler cache using ccache

Install ccache, I find it easiest to use a package manager, e.g.,

# macOS
brew install ccache

# Ubuntu/Debian Linux
apt install ccache

# Windows
winget install ccache

Whichever installation method you use make sure ccache is on your PATH after installation. You can test with, say,

ccache --version

To enable ccache, on macOS and Linux this goes in ~/.R/Makevars; on Windows it’s ~/.R/Makevars.win (create the directory and file if they don’t exist), set

# macOS
CC = ccache clang
CXX = ccache clang++
CXX17 = ccache clang++

# Windows and Linux
# Most Linux users will be on gcc by default
# Change to clang if you're using that
CC = ccache gcc
CXX = ccache g++
CXX17 = ccache g++

After a first compilation run for the cache to be generated, subsequent compilations are much faster.

  • Windows, second compilation: 18 seconds
  • M4 MacBook Air, second compilation: 5 seconds

Perhaps more importantly, if, say, your package has 5 models and you only amend the code for one of them, ccache knows to use the cache for the 4 unchanged models.

  • Windows, second compilation, only 1 model edited: 1 minute 10 seconds
  • M4 MacBook Air, second compilation, only 1 model edited: 19 seconds

You can verify ccache is working, by observing the timing decrease and by checking the output of

ccache -s

It is also useful to zero the ccache statistics before a timing run with

ccache -z

Testing which of your models takes the longest to compile

Here’s a quick script to test which model takes the longest to compile. Save it as say test.sh at the top level of your repo and add ^test\.sh$ to your .Rbuildignore file (to avoid an R CMD check NOTE about unknown files at the top level).

for model in inst/stan/*.stan; do
  cp "$model" "$model.bak"
  # Insert at the top of the file
  sed -i "1i // benchmark $(date +%s%N)" "$model"
  ccache -z
  SECONDS=0
  R CMD INSTALL --preclean . >/dev/null 2>&1
  echo "$(basename $model): ${SECONDS}s"
  ccache -s | grep -E "Hits|Misses" | head -2
  mv "$model.bak" "$model"
done

Finding your MAKEFLAGS sweet spot

With ccache installed you can now benchmark different -jN values cleanly (the ccache -C calls ensure each run is a cold compile, so you measure raw compilation cost rather than cache hits). You can increase the number sequence up to the number of processing cores your machine has.

for j in 1 2 3 4 6 8 10; do
  ccache -C >/dev/null
  echo "=== -j$j ==="
  SECONDS=0
  MAKEFLAGS=-j$j R CMD INSTALL --preclean . >/dev/null 2>&1
  echo "elapsed: ${SECONDS}s"
done

The timings on my MacBook Air were

=== -j1 ===
elapsed: 76s
=== -j2 ===
elapsed: 48s
=== -j3 ===
elapsed: 35s
=== -j4 ===
elapsed: 36s
=== -j6 ===
elapsed: 27s
=== -j8 ===
elapsed: 27s
=== -j10 ===
elapsed: 28s

My MacBook Air has 10 cores, but only 4 of those are performance cores, so I settled on -j6 as that is where my timings plateaued — and it leaves headroom for me inevitably checking my email during a compilation.

Big win 3: Combining these in GitHub Actions workflows

In my .github/workflows/R-CMD-check.yaml I have steps for these speedups. Firstly, to set MAKEFLAGS.

      - name: Set parallel compilation flags (Linux and macOS)
        if: runner.os != 'Windows'
        shell: bash
        run: |
          NCPUS=$(nproc 2>/dev/null || sysctl -n hw.logicalcpu)
          echo "Detected ${NCPUS} processors"
          echo "MAKEFLAGS=-j${NCPUS}" >> ~/.Renviron

      - name: Set parallel compilation flags (Windows)
        if: runner.os == 'Windows'
        shell: pwsh
        run: |
          Write-Output "Detected $env:NUMBER_OF_PROCESSORS processors"
          Add-Content -Path "$HOME\.Renviron" -Value "MAKEFLAGS=-j$env:NUMBER_OF_PROCESSORS"

You can also use ccache in GitHub Actions, as follows:

      # ccache speeds up Stan model compilation dramatically on warm cache.
      # Note: Windows support via ccache-action is documented as "probably works"
      # rather than fully stable; if it causes issues, scope this step to non-Windows.
      - name: Setup ccache
        uses: hendrikmuhs/[email protected]
        with:
          # Key invalidates when Stan models or DESCRIPTION change.
          # Older caches partially seed new ones via restore-keys.
          key: ccache-${{ matrix.config.os }}-R-${{ matrix.config.r }}-${{ hashFiles('inst/stan/**/*.stan', 'DESCRIPTION') }}
          restore-keys: |
            ccache-${{ matrix.config.os }}-R-${{ matrix.config.r }}-
            ccache-${{ matrix.config.os }}-R-
          max-size: "2G"

      - name: Configure R to use ccache (Linux and macOS)
        if: runner.os != 'Windows'
        shell: bash
        run: |
          mkdir -p ~/.R
          if [ "$RUNNER_OS" = "macOS" ]; then
            cat >> ~/.R/Makevars <<'EOF'
          CC = ccache clang
          CXX = ccache clang++
          CXX14 = ccache clang++
          CXX17 = ccache clang++
          CXX20 = ccache clang++
          EOF
          else
            cat >> ~/.R/Makevars <<'EOF'
          CC = ccache gcc
          CXX = ccache g++
          CXX14 = ccache g++
          CXX17 = ccache g++
          CXX20 = ccache g++
          EOF
          fi
          echo "--- ~/.R/Makevars ---"
          cat ~/.R/Makevars

      - name: Configure R to use ccache (Windows)
        if: runner.os == 'Windows'
        shell: pwsh
        run: |
          New-Item -ItemType Directory -Force -Path "$HOME\.R" | Out-Null
          $makevars = @"
          CC = ccache gcc
          CXX = ccache g++
          CXX14 = ccache g++
          CXX17 = ccache g++
          CXX20 = ccache g++
          "@
          Add-Content -Path "$HOME\.R\Makevars.win" -Value $makevars
          Write-Output "--- ~/.R/Makevars.win ---"
          Get-Content "$HOME\.R\Makevars.win"

You can see the full file in my repo.

This reduced my ubuntu-latest run for r-release from 7 minutes 30 seconds to 4 minutes 49 seconds.

Big win 4: Switch to clang

I found that switching from gcc to clang gives a noticeable speedup; the single core compile time dropped from 3 minutes 55 seconds to 3 minutes flat on my Windows machine.

To do this you need to install clang. On Windows you install clang within RTools45 — more involved than on Linux, but doable.

# Windows within RTools45 Bash shell
# Launch C:\rtools45\ucrt64.exe
# You may need to close and reopen the shell after the first command
pacman -Syu
pacman -S mingw-w64-ucrt-x86_64-clang

# Ubuntu/Debian Linux
sudo apt install clang

At this point on Windows running

which clang

should return /ucrt/bin/clang.

Switch to clang in ~/.R/Makevars (if you’re not using ccache delete that prefix)

# On Linux
CC = ccache clang
CXX = ccache clang++
CXX14 = ccache clang++
CXX17 = ccache clang++
CXX20 = ccache clang++

and in ~/.R/Makevars.win on Windows

# On Windows
CC = ccache C:/rtools45/ucrt64/bin/clang.exe
CXX = ccache C:/rtools45/ucrt64/bin/clang++.exe
CXX17 = ccache C:/rtools45/ucrt64/bin/clang++.exe

Windows users will need to add the following to PATH

C:\rtools45\ucrt64\bin
C:\rtools45\usr\bin

You can verify things are working by running

R CMD config CXX17

I believe you need clang version 18 or later to see the speedups.

Small win 1: WSL users should use the native file system

Within WSL it is possible to access files from within its native Linux filesystem, i.e., within /home/user/..., and also on the Windows filesystem, e.g., in /mnt/c/.... I believe file operations are noticeably faster within /home/user/....

Naive guesses that made no difference

I had wondered whether running a non-debug compilation with say

pkgbuild::compile_dll(debug = FALSE)

would speed things up. It turns out it does not. For Stan models, most of the time is spent in C++ template instantiation by the compiler, not in optimisation passes — so disabling debug flags or lowering the optimisation level barely helps.

I also wondered whether using R on Windows Subsystem for Linux would speed things up just by virtue of being on Linux. It did not, timings using gcc on Windows and WSL Ubuntu were essentially identical. The advantage of using WSL is that it is easier to switch to using clang on Linux.

(Money no object) Big win 5: Switch to an Apple Silicon Mac

Apple silicon Macs have excellent single threaded performance, their unified memory architecture has very high bandwidth, they have large L1 and L2 caches, and fast NVMe SSDs. Together these produce very fast Stan model compilation times, even on the lowest end Apple Silicon Macs.

Summary

In summary, five big wins and one small win for speeding up Stan model compilation in R packages.

To leave a comment for the author, please follow the link and comment on their blog: R | Dr Tom Palmer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)