# Tip: Optimize your Rcpp loops

**Florian Privé**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, I will show you how to optimize your `Rcpp`

loops so that they are 2 to 3 times faster than a standard implementation.

## Context

### Real data example

For this post, I will use a `big.matrix`

which represents genotypes for 15,283 individuals, corresponding to the number of mutations (0, 1 or 2) at 287,155 different loci. Here, I will use only the first 10,000 loci (columns).

What you need to know about the `big.matrix`

format:

- you can easily and quickly access matrice-like objects stored on disk,
- you can use different types of storage (I use type
`char`

to store each element on only 1 byte), - it is column-major ordered as standard
`R`

matrices, - you can access elements of a
`big.matrix`

using`X[i, j]`

in`R`

, - you can access elements of a
`big.matrix`

using`X[j][i]`

in`Rcpp`

, - you can get a
`RcppEigen`

or`RcppArmadillo`

view of a`big.matrix`

(see Appendix). - for more details, go to the GitHub repo.

Peek at the data:

print(dim(X)) ## [1] 15283 10000 print(X[1:10, 1:12]) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] ## [1,] 2 0 2 0 2 2 2 1 2 2 2 2 ## [2,] 2 0 1 2 1 1 1 1 2 1 2 2 ## [3,] 2 0 2 2 2 2 1 1 2 1 2 2 ## [4,] 2 2 0 2 0 0 0 2 2 2 0 2 ## [5,] 2 1 2 2 2 2 2 1 2 2 2 2 ## [6,] 2 1 2 1 2 2 1 1 2 2 2 2 ## [7,] 2 0 2 0 2 2 2 0 2 1 2 2 ## [8,] 2 1 1 2 1 1 1 1 2 1 2 2 ## [9,] 2 1 2 2 2 2 2 2 2 2 2 2 ## [10,] 2 0 2 1 2 2 2 0 2 1 2 1

### What I needed

I needed a fast matrix-vector multiplication between a `big.matrix`

and a vector. Moreover, I could not use any `RcppEigen`

or `RcppArmadillo`

multiplication because I needed some options of efficiently subsetting columns or rows in my matrix (see Appendix).

Writing this multiplication in `Rcpp`

is no more than two loops:

// [[Rcpp::depends(RcppEigen, bigmemory, BH)]] #include <RcppEigen.h> #include <bigmemory/MatrixAccessor.hpp> using namespace Rcpp; // [[Rcpp::export]] NumericVector prod1(XPtr<BigMatrix> bMPtr, const NumericVector& x) { MatrixAccessor<char> macc(*bMPtr); int n = bMPtr->nrow(); int m = bMPtr->ncol(); NumericVector res(n); int i, j; for (j = 0; j < m; j++) { for (i = 0; i < n; i++) { res[i] += macc[j][i] * x[j]; } } return res; }

One test:

y <- rnorm(ncol(X)) print(system.time( test <- prod1([email protected], y) )) ## user system elapsed ## 0.664 0.004 0.668

**What comes next should be transposable to other applications and other types of data.**

## Unrolling optimization

While searching for optimizing my multiplication, I came across this Stack Overflow answer.

Unrolling in action:

// [[Rcpp::depends(RcppEigen, bigmemory, BH)]] #include <RcppEigen.h> #include <bigmemory/MatrixAccessor.hpp> using namespace Rcpp; // [[Rcpp::export]] NumericVector prod4(XPtr<BigMatrix> bMPtr, const NumericVector& x) { MatrixAccessor<char> macc(*bMPtr); int n = bMPtr->nrow(); int m = bMPtr->ncol(); NumericVector res(n); int i, j; for (j = 0; j <= m - 4; j += 4) { for (i = 0; i < n; i++) { // unrolling optimization res[i] += (x[j] * macc[j][i] + x[j+1] * macc[j+1][i]) + (x[j+2] * macc[j+2][i] + x[j+3] * macc[j+3][i]); } // The parentheses are somehow important. Try without. } for (; j < m; j++) { for (i = 0; i < n; i++) { res[i] += x[j] * macc[j][i]; } } return res; } require(microbenchmark) print(microbenchmark( PROD1 = test1 <- prod1([email protected], y), PROD4 = test2 <- prod4([email protected], y), times = 5 )) ## Unit: milliseconds ## expr min lq mean median uq max neval ## PROD1 609.0916 612.6428 613.7418 613.3740 616.4907 617.1096 5 ## PROD4 262.2658 267.7352 267.0268 268.0026 268.0785 269.0521 5 print(all.equal(test1, test2)) ## [1] TRUE

Nice! Let’s try more. Why not using 8 or 16 rather than 4?

Rcpp::sourceCpp('https://privefl.github.io/blog/code/prods.cpp') print(bench <- microbenchmark( PROD1 = prod1([email protected], y), PROD2 = prod2([email protected], y), PROD4 = prod4([email protected], y), PROD8 = prod8([email protected], y), PROD16 = prod16([email protected], y), times = 5 )) ## Unit: milliseconds ## expr min lq mean median uq max neval ## PROD1 620.9375 627.9209 640.6087 631.1818 659.4236 663.5798 5 ## PROD2 407.6275 418.1752 417.1746 418.4589 419.0665 422.5451 5 ## PROD4 267.1687 271.4726 283.1928 271.9553 279.6698 325.6979 5 ## PROD8 241.5542 242.9120 255.4974 246.5218 267.7683 278.7307 5 ## PROD16 212.4335 213.5228 217.4781 217.1801 221.5119 222.7423 5 time <- summary(bench)[, "median"] step <- 2^(0:4) plot(step, time, type = "b", xaxt = "n", yaxt = "n", xlab = "size of each step") axis(side = 1, at = step) axis(side = 2, at = round(time))

## Conclusion

We have seen that unrolling can dramatically improve performances on loops. Steps of size 8 or 16 are of relatively little extra gain compared to 2 or 4.

As pointed out in the SO answer, it can behave rather differently between systems. So, if it is for your personal use, use the maximum gain (try 32!), but as I want my function to be used by others in a package, I think it’s safer to choose a step of 4.

## Appendix

You can do a `big.matrix`

-vector multiplication easily with `RcppEigen`

or `RcppArmadillo`

(see this code) but it lacks of efficient subsetting option.

Indeed, you still can’t use subsetting in `Eigen`

, but this will come as said in this feature request. For `Armadillo`

, you can but it is rather slow:

Rcpp::sourceCpp('https://privefl.github.io/blog/code/prods2.cpp') n <- nrow(X) ind <- sort(sample(n, size = n/2)) print(microbenchmark( EIGEN = test3 <- prodEigen([email protected], y), ARMA = test4 <- prodArma([email protected], y), ARMA_SUB = test5 <- prodArmaSub([email protected], y, ind - 1), times = 5 )) ## Unit: milliseconds ## expr min lq mean median uq max neval ## EIGEN 567.5607 570.1843 717.2433 572.9402 576.2028 1299.3285 5 ## ARMA 1242.3581 1263.8803 1329.1212 1264.7070 1284.5612 1590.0993 5 ## ARMA_SUB 455.1174 457.5862 466.3982 461.5883 465.9056 491.7935 5 print(all( all.equal(test3, test), all.equal(as.numeric(test4), test), all.equal(as.numeric(test5), test[ind]) )) ## [1] TRUE

**leave a comment**for the author, please follow the link and comment on their blog:

**Florian Privé**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.