**Florian Privé**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, I will show you how to optimize your `Rcpp`

loops so that they are 2 to 3 times faster than a standard implementation.

## Context

### Real data example

For this post, I will use a `big.matrix`

which represents genotypes for 15,283 individuals, corresponding to the number of mutations (0, 1 or 2) at 287,155 different loci. Here, I will use only the first 10,000 loci (columns).

What you need to know about the `big.matrix`

format:

- you can easily and quickly access matrice-like objects stored on disk,
- you can use different types of storage (I use type
`char`

to store each element on only 1 byte), - it is column-major ordered as standard
`R`

matrices, - you can access elements of a
`big.matrix`

using`X[i, j]`

in`R`

, - you can access elements of a
`big.matrix`

using`X[j][i]`

in`Rcpp`

, - you can get a
`RcppEigen`

or`RcppArmadillo`

view of a`big.matrix`

(see Appendix). - for more details, go to the GitHub repo.

Peek at the data:

`print(dim(X))`

`## [1] 15283 10000`

`print(X[1:10, 1:12])`

```
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 2 0 2 0 2 2 2 1 2 2 2 2
## [2,] 2 0 1 2 1 1 1 1 2 1 2 2
## [3,] 2 0 2 2 2 2 1 1 2 1 2 2
## [4,] 2 2 0 2 0 0 0 2 2 2 0 2
## [5,] 2 1 2 2 2 2 2 1 2 2 2 2
## [6,] 2 1 2 1 2 2 1 1 2 2 2 2
## [7,] 2 0 2 0 2 2 2 0 2 1 2 2
## [8,] 2 1 1 2 1 1 1 1 2 1 2 2
## [9,] 2 1 2 2 2 2 2 2 2 2 2 2
## [10,] 2 0 2 1 2 2 2 0 2 1 2 1
```

### What I needed

I needed a fast matrix-vector multiplication between a `big.matrix`

and a vector. Moreover, I could not use any `RcppEigen`

or `RcppArmadillo`

multiplication because I needed some options of efficiently subsetting columns or rows in my matrix (see Appendix).

Writing this multiplication in `Rcpp`

is no more than two loops:

```
// [[Rcpp::depends(RcppEigen, bigmemory, BH)]]
#include
```
#include
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector prod1(XPtr bMPtr, const NumericVector& x) {
MatrixAccessor<char> macc(*bMPtr);
int n = bMPtr->nrow();
int m = bMPtr->ncol();
NumericVector res(n);
int i, j;
for (j = 0; j < m; j++) {
for (i = 0; i < n; i++) {
res[i] += macc[j][i] * x[j];
}
}
return res;
}

One test:

```
y <- rnorm(ncol(X))
print(system.time(
test <- prod1([email protected], y)
))
```

```
## user system elapsed
## 0.664 0.004 0.668
```

**What comes next should be transposable to other applications and other types of data.**

## Unrolling optimization

While searching for optimizing my multiplication, I came across this Stack Overflow answer.

Unrolling in action:

```
// [[Rcpp::depends(RcppEigen, bigmemory, BH)]]
#include
```
#include
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector prod4(XPtr bMPtr, const NumericVector& x) {
MatrixAccessor<char> macc(*bMPtr);
int n = bMPtr->nrow();
int m = bMPtr->ncol();
NumericVector res(n);
int i, j;
for (j = 0; j <= m - 4; j += 4) {
for (i = 0; i < n; i++) { // unrolling optimization
res[i] += (x[j] * macc[j][i] + x[j+1] * macc[j+1][i]) +
(x[j+2] * macc[j+2][i] + x[j+3] * macc[j+3][i]);
} // The parentheses are somehow important. Try without.
}
for (; j < m; j++) {
for (i = 0; i < n; i++) {
res[i] += x[j] * macc[j][i];
}
}
return res;
}

```
require(microbenchmark)
print(microbenchmark(
PROD1 = test1 <- prod1([email protected], y),
PROD4 = test2 <- prod4([email protected], y),
times = 5
))
```

```
## Unit: milliseconds
## expr min lq mean median uq max neval
## PROD1 609.0916 612.6428 613.7418 613.3740 616.4907 617.1096 5
## PROD4 262.2658 267.7352 267.0268 268.0026 268.0785 269.0521 5
```

`print(all.equal(test1, test2))`

`## [1] TRUE`

Nice! Let’s try more. Why not using 8 or 16 rather than 4?

```
Rcpp::sourceCpp('https://privefl.github.io/blog/code/prods.cpp')
print(bench <- microbenchmark(
PROD1 = prod1([email protected], y),
PROD2 = prod2([email protected], y),
PROD4 = prod4([email protected], y),
PROD8 = prod8([email protected], y),
PROD16 = prod16([email protected], y),
times = 5
))
```

```
## Unit: milliseconds
## expr min lq mean median uq max neval
## PROD1 620.9375 627.9209 640.6087 631.1818 659.4236 663.5798 5
## PROD2 407.6275 418.1752 417.1746 418.4589 419.0665 422.5451 5
## PROD4 267.1687 271.4726 283.1928 271.9553 279.6698 325.6979 5
## PROD8 241.5542 242.9120 255.4974 246.5218 267.7683 278.7307 5
## PROD16 212.4335 213.5228 217.4781 217.1801 221.5119 222.7423 5
```

```
time <- summary(bench)[, "median"]
step <- 2^(0:4)
plot(step, time, type = "b", xaxt = "n", yaxt = "n",
xlab = "size of each step")
axis(side = 1, at = step)
axis(side = 2, at = round(time))
```

## Conclusion

We have seen that unrolling can dramatically improve performances on loops. Steps of size 8 or 16 are of relatively little extra gain compared to 2 or 4.

As pointed out in the SO answer, it can behave rather differently between systems. So, if it is for your personal use, use the maximum gain (try 32!), but as I want my function to be used by others in a package, I think it’s safer to choose a step of 4.

## Appendix

You can do a `big.matrix`

-vector multiplication easily with `RcppEigen`

or `RcppArmadillo`

(see this code) but it lacks of efficient subsetting option.

Indeed, you still can’t use subsetting in `Eigen`

, but this will come as said in this feature request. For `Armadillo`

, you can but it is rather slow:

```
Rcpp::sourceCpp('https://privefl.github.io/blog/code/prods2.cpp')
n <- nrow(X)
ind <- sort(sample(n, size = n/2))
print(microbenchmark(
EIGEN = test3 <- prodEigen([email protected], y),
ARMA = test4 <- prodArma([email protected], y),
ARMA_SUB = test5 <- prodArmaSub([email protected], y, ind - 1),
times = 5
))
```

```
## Unit: milliseconds
## expr min lq mean median uq max neval
## EIGEN 567.5607 570.1843 717.2433 572.9402 576.2028 1299.3285 5
## ARMA 1242.3581 1263.8803 1329.1212 1264.7070 1284.5612 1590.0993 5
## ARMA_SUB 455.1174 457.5862 466.3982 461.5883 465.9056 491.7935 5
```

```
print(all(
all.equal(test3, test),
all.equal(as.numeric(test4), test),
all.equal(as.numeric(test5), test[ind])
))
```

`## [1] TRUE`

**leave a comment**for the author, please follow the link and comment on their blog:

**Florian Privé**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.