**R Enthusiast and R/C++ hero**, and kindly contributed to R-bloggers)

I’ve been playing with parallelising Rcpp11 implementation of sugar. For example, we have a `NumericVector`

variable `x`

and we want to compute e.g. `sqrt(exp(x)) + 2.0`

. With sugar, we can do:

```
NumericVector y = sqrt(exp(x)) + 2.0 ;
```

and this does not generate useless temporaries as what R would do. This pitch has sailed already. Here are some benchmarks with a serial sugar based version:

```
// [[export]]
NumericVector cpp_serial( NumericVector x ){
NumericVector y = sqrt(exp(x)) + 2.0 ;
return y ;
}
```

and a pure R version:

```
R <- function(x) sqrt(exp(x)) + 2.0
```

we get these results:

```
> x <- rnorm(1e7)
> microbenchmark(R(x), cpp_serial(x), times = 10)
Unit: milliseconds
expr min lq median uq max neval
R(x) 233.4660 233.6885 234.0395 234.5152 265.7339 10
cpp_serial(x) 155.5668 155.8760 156.4211 164.3740 164.8633 10
```

Nice, we got speedup. But we are still just doing a serial computation, therefore not taking advantage of available hardware concurrency.

Concurrency is part of the C++11 standard, and using threads and related facilities has never been easier. I’ve started to work on taking advantage of that for sugar and after a few hours I’m happy with it. I’ve pushed the code to the master development branch on github. It is not enabled by default because unfortunately C++11 support available on some platforms is not complete and does not support the

header. I’ll keep my ranting about windows for another … thread. Anyway to enable the feature, you need to define the `RCPP11_EXPERIMENTAL_PARALLEL`

macro.

A parallel version of the previous code looks like that:

```
#define RCPP11_EXPERIMENTAL_PARALLEL
#include
```
using namespace Rcpp11 ;
// [[export]]
NumericVector cpp_4threads( NumericVector x ){
NumericVector y = threads(4) >> sqrt(exp(x)) + 2.0 ;
return y ;
}

The syntax is not set in stone, but I like how the specification of the number of threads to use and the actual sugar expression are separated. With `auto`

we can split the code into two statements without losing performance.

```
// [[export]]
NumericVector cpp_4threads( NumericVector x ){
auto fudge = sqrt(exp(x)) + 2.0 ;
NumericVector y = threads(4) >> fudge ;
return y ;
}
```

With the same data as before, I get these benchmark results:

```
$ Rcpp11Script -v /tmp/test.cpp
/Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB 'file163f711c09c0.cpp'
clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp11/include" -fPIC -O3 -c file163f711c09c0.cpp -o file163f711c09c0.o
clang++ -std=c++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/usr/local/lib -o file163f711c09c0.so file163f711c09c0.o -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
> x <- rnorm(1e+07)
> require(microbenchmark)
Loading required package: microbenchmark
> R <- function(x) sqrt(exp(x)) + 2
> microbenchmark(R(x), cpp_serial(x), cpp_4threads(x),
+ times = 10)
Unit: milliseconds
expr min lq median uq max neval
R(x) 233.5048 234.21418 242.78265 243.67149 246.35764 10
cpp_serial(x) 155.4508 155.57387 155.80123 164.56398 188.42263 10
cpp_4threads(x) 55.1221 55.38649 56.85528 64.20854 65.46276 10
```

Of course, as usual with threading, we don’t get a `4x`

performance improvement just because we use 4 threads, but a `3x`

is already pretty good.

Don’t just go litter your code with `threads(4) >>`

everywhere though, the performance gain depends on many factors such as the hardware, the data size and the individual cost of the scalar operation. If data size is too small, using threading can have a negative impact on performance, as the timings are dominated by the overhead induced by creating threads, …

For these reasons, this feature is not automatically used yet, i.e. you don’t get parallel sugar for free, you need to explicitely express it with the `threads(4) >>`

syntax. This might change in the future.

### Gory details

For the fans of the gory details, internally, all of this is implemented by a parallel version of `std::transform`

:

`template `
void transform( int nthreads, InputIterator begin, InputIterator end, OutputIterator target, Function fun ){
std::vector workers(nthreads-1) ;
R_xlen_t chunk_size = std::distance(begin, end) / nthreads ;
R_xlen_t start=0;
for( int i=0; i,
begin + start, begin + start + chunk_size,
target + start,
fun) ;
}
std::transform( begin + start, end, target + start, fun ) ;
for( int i=0; i

```
```

To **leave a comment** for the author, please follow the link and comment on their blog: ** R Enthusiast and R/C++ hero**.

R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...