Vectorized vs Devectorized

[This article was first published on R Enthusiast and R/C++ hero, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This gist from @hadley has been on my mind for some time. This was already a follow up of this post from @johnmyleswhite. The problem was that sugar vectorised code suffered a performance penalty compared to devectorised code. This is particularly troubling me because the whole point of sugar is to have nice syntax without paying the price of the abstraction.

The code using sugar looks like this:

void rcpp_vectorised() {  
    NumericVector a = NumericVector::create(1, 1);
    NumericVector b = NumericVector::create(2, 2);
    NumericVector x(2);

    for (int j = 0; j < 1e6; ++j) {
          x = a + b;
    }
}

and the devectorised version, using a manual for loop:

void rcpp_devectorised() {  
  NumericVector a = NumericVector::create(1, 1);
  NumericVector b = NumericVector::create(2, 2);
  NumericVector x(2);

  for (int j = 0; j < 1e6; ++j) {
      for (int i = 0; i < 2; ++i) {
          x[i] = a[i] + b[i];
      }
  }
}

So we have two NumericVector a and b of length 2 that we assign into another NumericVector of length 2. Here are benchmark results from Hadley's gist:

microbenchmark(  
  rcpp_vectorised(),
  rcpp_devectorised(), 
  unit = "s")
# Unit: seconds
#               expr       min       lq   median       uq     max neval
#   rcpp_vectorised() 0.012043 0.012308 0.012453 0.012580 0.01319   100
# rcpp_devectorised() 0.000837 0.000848 0.000865 0.000887 0.00101   100

So quite a difference indeed. There were a few reasons for this difference. The first reason was that sugar needed some modernisation. In the next release of Rcpp11, we've made progress in that direction and some sugar expression directly control how to apply themselves into their target vector. This differs from historic implementations (as found e.g. in Rcpp) which was entirely based on operator[].

The relevant working code for the operator+ between two vectors is now a direct call to std::transform :

template <typename Target>  
inline void apply( Target& target ) const {  
    std::transform( sugar_begin(lhs), sugar_end(lhs), sugar_begin(rhs), target.begin(), op ) ;       
}

With this at end, here are some new timings of the same code, but built against Rcpp11:

$ Rcpp11Script /tmp/sug.cpp

> require(microbenchmark)
Loading required package: microbenchmark

> microbenchmark(vec = rcpp_vectorised(), devec = rcpp_devectorised(),
+     unit = "s")
Unit: seconds  
  expr         min          lq      median          uq         max neval
   vec 0.004440250 0.004457024 0.004464254 0.004479742 0.006363814   100
 devec 0.001306806 0.001307507 0.001308996 0.001324055 0.001420102   100

The vectorised version is much better than the version in Rcpp:

$ RcppScript /tmp/sug.cpp

> require(microbenchmark)
Loading required package: microbenchmark

> microbenchmark(vec = rcpp_vectorised(), devec = rcpp_devectorised(),
+     unit = "s")
Unit: seconds  
  expr         min          lq      median          uq         max neval
   vec 0.014118233 0.014155100 0.014201735 0.014477436 0.015247361   100
 devec 0.001256794 0.001259593 0.001269401 0.001278607 0.001394016   100

But still not optimal. However, we can't go much further and I'll try to give some insights here. When we do:

for (int i = 0; i < 2; ++i) {  
    x[i] = a[i] + b[i];
}

The number of elements to process is known and assumed equal between x, a and b. But when we do:

x = a + b ;  

The assignment operator first has to check that the sizes match, so we need to get the length of at least x and a (we'll assume a and b have matching sizes). That's your time difference right there. Getting the length of a vector is not a free operation. It is even quite an expensive operation in Rcpp which delegates this to the Rf_length function from the R api:

INLINE_FUN R_len_t length(SEXP s)  
{
    int i;
    switch (TYPEOF(s)) {
    case NILSXP:
    return 0;
    case LGLSXP:
    case INTSXP:
    case REALSXP:
    case CPLXSXP:
    case STRSXP:
    case CHARSXP:
    case VECSXP:
    case EXPRSXP:
    case RAWSXP:
    return LENGTH(s);
    case LISTSXP:
    case LANGSXP:
    case DOTSXP:
    i = 0;
    while (s != NULL && s != R_NilValue) {
        i++;
        s = CDR(s);
    }
    return i;
    case ENVSXP:
    return Rf_envlength(s);
    default:
    return 1;
    }
}

This contains redundant code, e.g. we don't need the switch as we already know the R type of the object, that's the whole point of scoping these R objects in C++ classes. And furthermore LENGTH itself has work to do, depending on whether there is support for long vectors it is implemented either like this:

# define SHORT_VEC_LENGTH(x) (((VECSEXP) (x))->vecsxp.length)
# define LENGTH(x) (IS_LONG_VEC(x) ? R_BadLongVector(x, __FILE__, __LINE__) : SHORT_VEC_LENGTH(x))

or directly like this:

# define LENGTH(x)    (((VECSEXP) (x))->vecsxp.length)

So getting the length of a vector is not a free operation. That's why we can't match with the devectorised version in this case because the timings are completely dominated by it.

Here are variations of @hadley's code but doing the measuring internally to reduce various overheads (.Call, ...) :

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::plugins(cpp11)]]

typedef std::chrono::high_resolution_clock Clock ;  
typedef std::chrono::duration<double, std::micro> microseconds ;  
typedef Clock::time_point time_point ;

// [[Rcpp::export]]
double rcpp_devectorised(NumericVector a, NumericVector b, NumericVector x, int n, int niter ){  
  time_point start = Clock::now() ;

  for (int j = 0; j < niter; ++j) {
    for (int i = 0; i < n; ++i) {
      x[i] = a[i] + b[i];
    }
  }

  return std::chrono::duration_cast<microseconds>( Clock::now() - start ).count() ;
}

// [[Rcpp::export]]
double rcpp_vectorised(NumericVector a, NumericVector b, NumericVector x, int n, int niter ){  
  time_point start = Clock::now() ;

  for (int j = 0; j < niter; ++j) {
    x = a + b ;
  }

  return std::chrono::duration_cast<microseconds>( Clock::now() - start ).count() ;
}

// [[Rcpp::export]]
List rcpp_length_vec(NumericVector x, NumericVector a, int niter){  
  time_point start = Clock::now() ;

  int n = 0;
  for (int j = 0; j < niter; ++j) {
    n += x.size() + a.size() ;
  }
  time_point end = Clock::now() ;

  // here n is returned so that the compiler will not outsmart us and just not do anything. 
  // which would be likely to happen otherwise
  return List::create( std::chrono::duration_cast<microseconds>( end - start ).count(), n ) ;
}

So we have the devectorised, vectorised and a third function calling as many times .size() as needed by the sugar implementation. We can compile this file with both Rcpp and Rcpp11 and make our own benchmarking. An interesting thing to note is that none of these functions allocate between the two ticks, so there's no garbage collector pollution of the timings.

require(microbenchmark)

env <- new.env()  
Rcpp::sourceCpp("/tmp/sugar.cpp", env = env)

Rcpp_vec <- env[["rcpp_vectorised"]]  
Rcpp_devec <- env[["rcpp_devectorised"]]  
Rcpp_len <- env[["rcpp_length_vec"]]

attributes::sourceCpp( "/tmp/sugar.cpp")  
Rcpp11_vec <- rcpp_vectorised  
Rcpp11_devec <- rcpp_devectorised  
Rcpp11_len <- rcpp_length_vec

bench <- function(n, niter){  
    a <- rnorm(n)
    b <- rnorm(n)
    x <- numeric(n)

    do.call( rbind, lapply( list( 
        Rcpp_vec     = replicate( 100, Rcpp_vec(a,b,x,n,niter) ),
        Rcpp11_vec   = replicate( 100, Rcpp11_vec(a,b,x,n,niter) ),

        Rcpp_devec   = replicate( 100, Rcpp_devec(a,b,x,n,niter)),
        Rcpp11_devec = replicate( 100, Rcpp11_devec(a,b,x,n,niter)), 

        Rcpp_len   = replicate( 100, Rcpp_len(x,a,niter)[[1]] ),
        Rcpp11_len = replicate( 100, Rcpp11_len(x,a,niter)[[1]] )

    ), summary ) )
}



n = 2 n = 10 n = 100 n = 1000
vectorised (Rcpp) 14476.33 27838.04 198466.63 1931035.86
vectorised (Rcpp11)
4715.86 10003.69 44361.24 413752.18
devectorised (Rcpp) 3776.75 7960.56 39628.00 410612.11
devectorised (Rcpp11)
3775.82 7899.00 39644.37 410655.31
lenghts (Rcpp) 8517.55 8513.23 8492.28 8501.91
lengths (Rcpp11) 939.88 939.88 939.88 939.88

A few things form reading this table:

  • For the small vectors, there is a noticable difference of performance between the vectorised and devectorised code using Rcpp11, but the difference is fixed so its weight becomes smaller as the size of the vector grows
  • Just getting the lengths of vectors is cheaper with Rcpp11 as we use closer to the metal macro rather than the over cautious Rf_length.

This is I think good enough. A potential source of improvement for the Rcpp11 version could be to take advantage of multithreading, either manually or by leveraging a threaded STL implementation. All we would need is a threaded version of std::transform. I'll save this for later, especially given that we can't yet use <thread> on windows with the current version of Rtools.

A design question

Good if you made it all the way down, you might be able to help me. At the moment, Vector::size() delegates to SHORT_VEC_LENGTH (I'll get to long vector support later) which is cheap but not free. An alternative would be to store the length as part of the C++ object. With this, retrieving the length would be instant, but we would have to calculate it every time we create a Vector or update its underlying data.

I'm happy with both options. Ideas ? Comments ?

To leave a comment for the author, please follow the link and comment on their blog: R Enthusiast and R/C++ hero.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)