**R Enthusiast and R/C++ hero**, and kindly contributed to R-bloggers)

This gist from @hadley has been on my mind for some time. This was already a follow up of this post from @johnmyleswhite. The problem was that sugar vectorised code suffered a performance penalty compared to devectorised code. This is particularly troubling me because the whole point of sugar is to have nice syntax without paying the price of the abstraction.

The code using sugar looks like this:

```
void rcpp_vectorised() {
NumericVector a = NumericVector::create(1, 1);
NumericVector b = NumericVector::create(2, 2);
NumericVector x(2);
for (int j = 0; j < 1e6; ++j) {
x = a + b;
}
}
```

and the devectorised version, using a manual for loop:

```
void rcpp_devectorised() {
NumericVector a = NumericVector::create(1, 1);
NumericVector b = NumericVector::create(2, 2);
NumericVector x(2);
for (int j = 0; j < 1e6; ++j) {
for (int i = 0; i < 2; ++i) {
x[i] = a[i] + b[i];
}
}
}
```

So we have two `NumericVector`

`a`

and `b`

of length `2`

that we assign into another `NumericVector`

of length `2`

. Here are benchmark results from Hadley’s gist:

```
microbenchmark(
rcpp_vectorised(),
rcpp_devectorised(),
unit = "s")
# Unit: seconds
# expr min lq median uq max neval
# rcpp_vectorised() 0.012043 0.012308 0.012453 0.012580 0.01319 100
# rcpp_devectorised() 0.000837 0.000848 0.000865 0.000887 0.00101 100
```

So quite a difference indeed. There were a few reasons for this difference. The first reason was that sugar needed some modernisation. In the next release of `Rcpp11`

, we’ve made progress in that direction and some sugar expression directly control how to apply themselves into their target vector. This differs from historic implementations (as found e.g. in `Rcpp`

) which was entirely based on `operator[]`

.

The relevant working code for the `operator+`

between two vectors is now a direct call to `std::transform`

:

`template `
inline void apply( Target& target ) const {
std::transform( sugar_begin(lhs), sugar_end(lhs), sugar_begin(rhs), target.begin(), op ) ;
}

With this at end, here are some new timings of the same code, but built against `Rcpp11`

:

```
$ Rcpp11Script /tmp/sug.cpp
> require(microbenchmark)
Loading required package: microbenchmark
> microbenchmark(vec = rcpp_vectorised(), devec = rcpp_devectorised(),
+ unit = "s")
Unit: seconds
expr min lq median uq max neval
vec 0.004440250 0.004457024 0.004464254 0.004479742 0.006363814 100
devec 0.001306806 0.001307507 0.001308996 0.001324055 0.001420102 100
```

The vectorised version is much better than the version in `Rcpp`

:

```
$ RcppScript /tmp/sug.cpp
> require(microbenchmark)
Loading required package: microbenchmark
> microbenchmark(vec = rcpp_vectorised(), devec = rcpp_devectorised(),
+ unit = "s")
Unit: seconds
expr min lq median uq max neval
vec 0.014118233 0.014155100 0.014201735 0.014477436 0.015247361 100
devec 0.001256794 0.001259593 0.001269401 0.001278607 0.001394016 100
```

But still not optimal. However, we can’t go much further and I’ll try to give some insights here. When we do:

```
for (int i = 0; i < 2; ++i) {
x[i] = a[i] + b[i];
}
```

The number of elements to process is known and assumed equal between `x`

, `a`

and `b`

. But when we do:

```
x = a + b ;
```

The assignment operator first has to check that the sizes match, so we need to get the length of at least `x`

and `a`

(we’ll assume `a`

and `b`

have matching sizes). That’s your time difference right there. Getting the length of a vector is not a free operation. It is even quite an expensive operation in `Rcpp`

which delegates this to the `Rf_length`

function from the R api:

```
INLINE_FUN R_len_t length(SEXP s)
{
int i;
switch (TYPEOF(s)) {
case NILSXP:
return 0;
case LGLSXP:
case INTSXP:
case REALSXP:
case CPLXSXP:
case STRSXP:
case CHARSXP:
case VECSXP:
case EXPRSXP:
case RAWSXP:
return LENGTH(s);
case LISTSXP:
case LANGSXP:
case DOTSXP:
i = 0;
while (s != NULL && s != R_NilValue) {
i++;
s = CDR(s);
}
return i;
case ENVSXP:
return Rf_envlength(s);
default:
return 1;
}
}
```

This contains redundant code, e.g. we don’t need the switch as we already know the R type of the object, that’s the whole point of scoping these R objects in C++ classes. And furthermore `LENGTH`

itself has work to do, depending on whether there is support for long vectors it is implemented either like this:

```
# define SHORT_VEC_LENGTH(x) (((VECSEXP) (x))->vecsxp.length)
# define LENGTH(x) (IS_LONG_VEC(x) ? R_BadLongVector(x, __FILE__, __LINE__) : SHORT_VEC_LENGTH(x))
```

or directly like this:

```
# define LENGTH(x) (((VECSEXP) (x))->vecsxp.length)
```

So getting the length of a vector is not a free operation. That’s why we can’t match with the devectorised version in this case because the timings are completely dominated by it.

Here are variations of @hadley‘s code but doing the measuring internally to reduce various overheads (`.Call`

, …) :

`#include `
using namespace Rcpp ;
// [[Rcpp::plugins(cpp11)]]
typedef std::chrono::high_resolution_clock Clock ;
typedef std::chrono::duration microseconds ;
typedef Clock::time_point time_point ;
// [[Rcpp::export]]
double rcpp_devectorised(NumericVector a, NumericVector b, NumericVector x, int n, int niter ){
time_point start = Clock::now() ;
for (int j = 0; j < niter; ++j) {
for (int i = 0; i < n; ++i) {
x[i] = a[i] + b[i];
}
}
return std::chrono::duration_cast( Clock::now() - start ).count() ;
}
// [[Rcpp::export]]
double rcpp_vectorised(NumericVector a, NumericVector b, NumericVector x, int n, int niter ){
time_point start = Clock::now() ;
for (int j = 0; j < niter; ++j) {
x = a + b ;
}
return std::chrono::duration_cast( Clock::now() - start ).count() ;
}
// [[Rcpp::export]]
List rcpp_length_vec(NumericVector x, NumericVector a, int niter){
time_point start = Clock::now() ;
int n = 0;
for (int j = 0; j < niter; ++j) {
n += x.size() + a.size() ;
}
time_point end = Clock::now() ;
// here n is returned so that the compiler will not outsmart us and just not do anything.
// which would be likely to happen otherwise
return List::create( std::chrono::duration_cast( end - start ).count(), n ) ;
}

So we have the devectorised, vectorised and a third function calling as many times `.size()`

as needed by the sugar implementation. We can compile this file with both `Rcpp`

and `Rcpp11`

and make our own benchmarking. An interesting thing to note is that none of these functions allocate between the two ticks, so there’s no garbage collector pollution of the timings.

```
require(microbenchmark)
env <- new.env()
Rcpp::sourceCpp("/tmp/sugar.cpp", env = env)
Rcpp_vec <- env[["rcpp_vectorised"]]
Rcpp_devec <- env[["rcpp_devectorised"]]
Rcpp_len <- env[["rcpp_length_vec"]]
attributes::sourceCpp( "/tmp/sugar.cpp")
Rcpp11_vec <- rcpp_vectorised
Rcpp11_devec <- rcpp_devectorised
Rcpp11_len <- rcpp_length_vec
bench <- function(n, niter){
a <- rnorm(n)
b <- rnorm(n)
x <- numeric(n)
do.call( rbind, lapply( list(
Rcpp_vec = replicate( 100, Rcpp_vec(a,b,x,n,niter) ),
Rcpp11_vec = replicate( 100, Rcpp11_vec(a,b,x,n,niter) ),
Rcpp_devec = replicate( 100, Rcpp_devec(a,b,x,n,niter)),
Rcpp11_devec = replicate( 100, Rcpp11_devec(a,b,x,n,niter)),
Rcpp_len = replicate( 100, Rcpp_len(x,a,niter)[[1]] ),
Rcpp11_len = replicate( 100, Rcpp11_len(x,a,niter)[[1]] )
), summary ) )
}
```

n = 2 | n = 10 | n = 100 | n = 1000 | |
---|---|---|---|---|

vectorised (Rcpp) | 14476.33 | 27838.04 | 198466.63 | 1931035.86 |

vectorised (Rcpp11) | 4715.86 | 10003.69 | 44361.24 | 413752.18 |

devectorised (Rcpp) | 3776.75 | 7960.56 | 39628.00 | 410612.11 |

devectorised (Rcpp11) | 3775.82 | 7899.00 | 39644.37 | 410655.31 |

lenghts (Rcpp) | 8517.55 | 8513.23 | 8492.28 | 8501.91 |

lengths (Rcpp11) | 939.88 | 939.88 | 939.88 | 939.88 |

A few things form reading this table:

- For the small vectors, there is a noticable difference of performance between the vectorised and devectorised code using
`Rcpp11`

, but the difference is fixed so its weight becomes smaller as the size of the vector grows - Just getting the lengths of vectors is cheaper with
`Rcpp11`

as we use closer to the metal macro rather than the over cautious`Rf_length`

.

This is I think good enough. A potential source of improvement for the `Rcpp11`

version could be to take advantage of multithreading, either manually or by leveraging a threaded STL implementation. All we would need is a threaded version of `std::transform`

. I’ll save this for later, especially given that we can’t yet use

on windows with the current version of Rtools.

## A design question

Good if you made it all the way down, you might be able to help me. At the moment, `Vector::size()`

delegates to `SHORT_VEC_LENGTH`

(I’ll get to long vector support later) which is cheap but not free. An alternative would be to store the length as part of the C++ object. With this, retrieving the length would be instant, but we would have to calculate it every time we create a `Vector`

or update its underlying data.

I’m happy with both options. Ideas ? Comments ?

**leave a comment**for the author, please follow the link and comment on their blog:

**R Enthusiast and R/C++ hero**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...