# sugar in parallel

June 18, 2014
By

(This article was first published on R Enthusiast and R/C++ hero, and kindly contributed to R-bloggers)

I've been playing with parallelising Rcpp11 implementation of sugar. For example, we have a NumericVector variable x and we want to compute e.g. sqrt(exp(x)) + 2.0. With sugar, we can do:

NumericVector y = sqrt(exp(x)) + 2.0 ;


and this does not generate useless temporaries as what R would do. This pitch has sailed already. Here are some benchmarks with a serial sugar based version:

// [[export]]
NumericVector cpp_serial( NumericVector x ){
NumericVector y = sqrt(exp(x)) + 2.0  ;
return y ;
}


and a pure R version:

R <- function(x) sqrt(exp(x)) + 2.0


we get these results:

> x <- rnorm(1e7)
> microbenchmark(R(x), cpp_serial(x), times = 10)
Unit: milliseconds
expr      min       lq   median       uq      max neval
R(x) 233.4660 233.6885 234.0395 234.5152 265.7339    10
cpp_serial(x) 155.5668 155.8760 156.4211 164.3740 164.8633    10


Nice, we got speedup. But we are still just doing a serial computation, therefore not taking advantage of available hardware concurrency.

Concurrency is part of the C++11 standard, and using threads and related facilities has never been easier. I've started to work on taking advantage of that for sugar and after a few hours I'm happy with it. I've pushed the code to the master development branch on github. It is not enabled by default because unfortunately C++11 support available on some platforms is not complete and does not support the <thread> header. I'll keep my ranting about windows for another ... thread. Anyway to enable the feature, you need to define the RCPP11_EXPERIMENTAL_PARALLEL macro.

A parallel version of the previous code looks like that:

#define RCPP11_EXPERIMENTAL_PARALLEL

#include <Rcpp11>
using namespace Rcpp11 ;

// [[export]]
NumericVector y = threads(4) >> sqrt(exp(x)) + 2.0 ;
return y ;
}


The syntax is not set in stone, but I like how the specification of the number of threads to use and the actual sugar expression are separated. With auto we can split the code into two statements without losing performance.

// [[export]]
auto fudge = sqrt(exp(x)) + 2.0 ;
NumericVector y = threads(4) >> fudge ;
return y ;
}


With the same data as before, I get these benchmark results:

\$ Rcpp11Script -v /tmp/test.cpp
/Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB 'file163f711c09c0.cpp'
clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG  -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include  -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp11/include"   -fPIC  -O3 -c file163f711c09c0.cpp -o file163f711c09c0.o
clang++ -std=c++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/usr/local/lib -o file163f711c09c0.so file163f711c09c0.o -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation

> x <- rnorm(1e+07)

> require(microbenchmark)

> R <- function(x) sqrt(exp(x)) + 2

+     times = 10)
Unit: milliseconds
expr      min        lq    median        uq       max neval
R(x) 233.5048 234.21418 242.78265 243.67149 246.35764    10
cpp_serial(x) 155.4508 155.57387 155.80123 164.56398 188.42263    10
cpp_4threads(x)  55.1221  55.38649  56.85528  64.20854  65.46276    10


Of course, as usual with threading, we don't get a 4x performance improvement just because we use 4 threads, but a 3x is already pretty good.

Don't just go litter your code with threads(4) >> everywhere though, the performance gain depends on many factors such as the hardware, the data size and the individual cost of the scalar operation. If data size is too small, using threading can have a negative impact on performance, as the timings are dominated by the overhead induced by creating threads, ...

For these reasons, this feature is not automatically used yet, i.e. you don't get parallel sugar for free, you need to explicitely express it with the threads(4) >> syntax. This might change in the future.

### Gory details

For the fans of the gory details, internally, all of this is implemented by a parallel version of std::transform :

template <typename InputIterator, typename OutputIterator, typename Function>
void transform( int nthreads, InputIterator begin, InputIterator end, OutputIterator target, Function fun ){
R_xlen_t chunk_size = std::distance(begin, end) / nthreads ;
R_xlen_t start=0;
for( int i=0; i<nthreads-1; i++, start+=chunk_size){
workers[i] = std::thread( std::transform<InputIterator, OutputIterator, Function>,
begin + start, begin + start + chunk_size,
target + start,
fun) ;
}
std::transform( begin + start, end, target + start, fun ) ;
for( int i=0; i<nthreads-1; i++) workers[i].join() ;
}