Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This gist from @hadley has been on my mind for some time. This was already a follow up of this post from @johnmyleswhite. The problem was that sugar vectorised code suffered a performance penalty compared to devectorised code. This is particularly troubling me because the whole point of sugar is to have nice syntax without paying the price of the abstraction.

The code using sugar looks like this:

void rcpp_vectorised() {
NumericVector a = NumericVector::create(1, 1);
NumericVector b = NumericVector::create(2, 2);
NumericVector x(2);

for (int j = 0; j < 1e6; ++j) {
x = a + b;
}
}


and the devectorised version, using a manual for loop:

void rcpp_devectorised() {
NumericVector a = NumericVector::create(1, 1);
NumericVector b = NumericVector::create(2, 2);
NumericVector x(2);

for (int j = 0; j < 1e6; ++j) {
for (int i = 0; i < 2; ++i) {
x[i] = a[i] + b[i];
}
}
}


So we have two NumericVector a and b of length 2 that we assign into another NumericVector of length 2. Here are benchmark results from Hadley's gist:

microbenchmark(
rcpp_vectorised(),
rcpp_devectorised(),
unit = "s")
# Unit: seconds
#               expr       min       lq   median       uq     max neval
#   rcpp_vectorised() 0.012043 0.012308 0.012453 0.012580 0.01319   100
# rcpp_devectorised() 0.000837 0.000848 0.000865 0.000887 0.00101   100


So quite a difference indeed. There were a few reasons for this difference. The first reason was that sugar needed some modernisation. In the next release of Rcpp11, we've made progress in that direction and some sugar expression directly control how to apply themselves into their target vector. This differs from historic implementations (as found e.g. in Rcpp) which was entirely based on operator[].

The relevant working code for the operator+ between two vectors is now a direct call to std::transform :

template <typename Target>
inline void apply( Target& target ) const {
std::transform( sugar_begin(lhs), sugar_end(lhs), sugar_begin(rhs), target.begin(), op ) ;
}


With this at end, here are some new timings of the same code, but built against Rcpp11:

$Rcpp11Script /tmp/sug.cpp > require(microbenchmark) Loading required package: microbenchmark > microbenchmark(vec = rcpp_vectorised(), devec = rcpp_devectorised(), + unit = "s") Unit: seconds expr min lq median uq max neval vec 0.004440250 0.004457024 0.004464254 0.004479742 0.006363814 100 devec 0.001306806 0.001307507 0.001308996 0.001324055 0.001420102 100  The vectorised version is much better than the version in Rcpp: $ RcppScript /tmp/sug.cpp

> require(microbenchmark)

> microbenchmark(vec = rcpp_vectorised(), devec = rcpp_devectorised(),
+     unit = "s")
Unit: seconds
expr         min          lq      median          uq         max neval
vec 0.014118233 0.014155100 0.014201735 0.014477436 0.015247361   100
devec 0.001256794 0.001259593 0.001269401 0.001278607 0.001394016   100


But still not optimal. However, we can't go much further and I'll try to give some insights here. When we do:

for (int i = 0; i < 2; ++i) {
x[i] = a[i] + b[i];
}


The number of elements to process is known and assumed equal between x, a and b. But when we do:

x = a + b ;


The assignment operator first has to check that the sizes match, so we need to get the length of at least x and a (we'll assume a and b have matching sizes). That's your time difference right there. Getting the length of a vector is not a free operation. It is even quite an expensive operation in Rcpp which delegates this to the Rf_length function from the R api:

INLINE_FUN R_len_t length(SEXP s)
{
int i;
switch (TYPEOF(s)) {
case NILSXP:
return 0;
case LGLSXP:
case INTSXP:
case REALSXP:
case CPLXSXP:
case STRSXP:
case CHARSXP:
case VECSXP:
case EXPRSXP:
case RAWSXP:
return LENGTH(s);
case LISTSXP:
case LANGSXP:
case DOTSXP:
i = 0;
while (s != NULL && s != R_NilValue) {
i++;
s = CDR(s);
}
return i;
case ENVSXP:
return Rf_envlength(s);
default:
return 1;
}
}


This contains redundant code, e.g. we don't need the switch as we already know the R type of the object, that's the whole point of scoping these R objects in C++ classes. And furthermore LENGTH itself has work to do, depending on whether there is support for long vectors it is implemented either like this:

# define SHORT_VEC_LENGTH(x) (((VECSEXP) (x))->vecsxp.length)
# define LENGTH(x) (IS_LONG_VEC(x) ? R_BadLongVector(x, __FILE__, __LINE__) : SHORT_VEC_LENGTH(x))


or directly like this:

# define LENGTH(x)    (((VECSEXP) (x))->vecsxp.length)


So getting the length of a vector is not a free operation. That's why we can't match with the devectorised version in this case because the timings are completely dominated by it.

Here are variations of @hadley's code but doing the measuring internally to reduce various overheads (.Call, ...) :

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::plugins(cpp11)]]

typedef std::chrono::high_resolution_clock Clock ;
typedef std::chrono::duration<double, std::micro> microseconds ;
typedef Clock::time_point time_point ;

// [[Rcpp::export]]
double rcpp_devectorised(NumericVector a, NumericVector b, NumericVector x, int n, int niter ){
time_point start = Clock::now() ;

for (int j = 0; j < niter; ++j) {
for (int i = 0; i < n; ++i) {
x[i] = a[i] + b[i];
}
}

return std::chrono::duration_cast<microseconds>( Clock::now() - start ).count() ;
}

// [[Rcpp::export]]
double rcpp_vectorised(NumericVector a, NumericVector b, NumericVector x, int n, int niter ){
time_point start = Clock::now() ;

for (int j = 0; j < niter; ++j) {
x = a + b ;
}

return std::chrono::duration_cast<microseconds>( Clock::now() - start ).count() ;
}

// [[Rcpp::export]]
List rcpp_length_vec(NumericVector x, NumericVector a, int niter){
time_point start = Clock::now() ;

int n = 0;
for (int j = 0; j < niter; ++j) {
n += x.size() + a.size() ;
}
time_point end = Clock::now() ;

// here n is returned so that the compiler will not outsmart us and just not do anything.
// which would be likely to happen otherwise
return List::create( std::chrono::duration_cast<microseconds>( end - start ).count(), n ) ;
}


So we have the devectorised, vectorised and a third function calling as many times .size() as needed by the sugar implementation. We can compile this file with both Rcpp and Rcpp11 and make our own benchmarking. An interesting thing to note is that none of these functions allocate between the two ticks, so there's no garbage collector pollution of the timings.

require(microbenchmark)

env <- new.env()
Rcpp::sourceCpp("/tmp/sugar.cpp", env = env)

Rcpp_vec <- env[["rcpp_vectorised"]]
Rcpp_devec <- env[["rcpp_devectorised"]]
Rcpp_len <- env[["rcpp_length_vec"]]

attributes::sourceCpp( "/tmp/sugar.cpp")
Rcpp11_vec <- rcpp_vectorised
Rcpp11_devec <- rcpp_devectorised
Rcpp11_len <- rcpp_length_vec

bench <- function(n, niter){
a <- rnorm(n)
b <- rnorm(n)
x <- numeric(n)

do.call( rbind, lapply( list(
Rcpp_vec     = replicate( 100, Rcpp_vec(a,b,x,n,niter) ),
Rcpp11_vec   = replicate( 100, Rcpp11_vec(a,b,x,n,niter) ),

Rcpp_devec   = replicate( 100, Rcpp_devec(a,b,x,n,niter)),
Rcpp11_devec = replicate( 100, Rcpp11_devec(a,b,x,n,niter)),

Rcpp_len   = replicate( 100, Rcpp_len(x,a,niter)[] ),
Rcpp11_len = replicate( 100, Rcpp11_len(x,a,niter)[] )

), summary ) )
}


n = 2 n = 10 n = 100 n = 1000
vectorised (Rcpp) 14476.33 27838.04 198466.63 1931035.86
vectorised (Rcpp11)
4715.86 10003.69 44361.24 413752.18
devectorised (Rcpp) 3776.75 7960.56 39628.00 410612.11
devectorised (Rcpp11)
3775.82 7899.00 39644.37 410655.31
lenghts (Rcpp) 8517.55 8513.23 8492.28 8501.91
lengths (Rcpp11) 939.88 939.88 939.88 939.88

A few things form reading this table:

• For the small vectors, there is a noticable difference of performance between the vectorised and devectorised code using Rcpp11, but the difference is fixed so its weight becomes smaller as the size of the vector grows
• Just getting the lengths of vectors is cheaper with Rcpp11 as we use closer to the metal macro rather than the over cautious Rf_length.

This is I think good enough. A potential source of improvement for the Rcpp11 version could be to take advantage of multithreading, either manually or by leveraging a threaded STL implementation. All we would need is a threaded version of std::transform. I'll save this for later, especially given that we can't yet use <thread> on windows with the current version of Rtools.

## A design question

Good if you made it all the way down, you might be able to help me. At the moment, Vector::size() delegates to SHORT_VEC_LENGTH (I'll get to long vector support later) which is cheap but not free. An alternative would be to store the length as part of the C++ object. With this, retrieving the length would be instant, but we would have to calculate it every time we create a Vector or update its underlying data.

I'm happy with both options. Ideas ? Comments ?