Summing a Vector in Parallel with RcppParallel

June 28, 2014

(This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers)

The RcppParallel package includes
high level functions for doing parallel programming with Rcpp. For example,
the parallelReduce function can be used aggreggate values from a set of
inputs in parallel. This article describes using RcppParallel to sum an R

Serial Version

First a serial version of computing the sum of a vector. For this we use
a simple call to the STL std::accumulate function:

using namespace Rcpp;


// [[Rcpp::export]]
double vectorSum(NumericVector x) {
   return std::accumulate(x.begin(), x.end(), 0.0);

Parallel Version

Now we adapt our code to run in parallel. We’ll use the parallelReduce
function to do this. As with the previous
describing parallelFor,
we implement a “Worker” function object with our logic and RcppParallel takes
care of scheduling work on threads and calling our function when required. For
parallelReduce the function object has three jobs:

  1. Implement a standard and “splitting” constructor. The standard constructor
    takes a pointer to the array that will be traversed and sets it’s sum
    variable to 0. The splitting constructor is called when work needs to be
    split onto other threads—it takes a reference to the instance it is being
    split from and simply copies the pointer to the input array and sets it’s
    internal sum to 0.

  2. Implement operator() to perform the summing. Here we just call
    std::accumulate as we did in the serial version, but limit the accumulation
    to the items specified by the begin and end arguments (note that other
    threads will have been given the task of processing other items in the input
    array). We save the accumulated value in our value member variable.

  3. Finally, we implement a join method which composes the operations of two
    Sum instances that were previously split. Here we simply add the accumulated
    sum of the instance we are being joined with to our own.

Here’s the definition of the Sum function object:

// [[Rcpp::depends(RcppParallel)]]
using namespace RcppParallel;

struct Sum : public Worker
   // source vector
   const RVector<double> input;
   // accumulated value
   double value;
   // constructors
   Sum(const NumericVector input) : input(input), value(0) {}
   Sum(const Sum& sum, Split) : input(sum.input), value(0) {}
   // accumulate just the element of the range I've been asked to
   void operator()(std::size_t begin, std::size_t end) {
      value += std::accumulate(input.begin() + begin, input.begin() + end, 0.0);
   // join my value with that of another Sum
   void join(const Sum& rhs) { 
      value += rhs.value; 

Note that Sum derives from the RcppParallel::Worker class. This is
required for function objects passed to parallelReduce.

Note also that we use the RVector type for accessing the vector.
This is because this code will execute on a background thread where it’s not
safe to call R or Rcpp APIs. The RVector class is included in the
RcppParallel package and provides a lightweight, thread-safe wrapper around R

Now that we’ve defined the functor, implementing the parallel sum
function is straightforward. Just initialize an instance of Sum
with the input vector and call parallelReduce:

// [[Rcpp::export]]
double parallelVectorSum(NumericVector x) {
   // declare the SumBody instance 
   Sum sum(x);
   // call parallel_reduce to start the work
   parallelReduce(0, x.length(), sum);
   // return the computed sum
   return sum.value;


A comparison of the performance of the two functions shows the parallel
version performing about 4 times as fast on a machine with 4 cores:

# allocate a vector
v <- as.numeric(c(1:10000000))

# ensure that serial and parallel versions give the same result
stopifnot(identical(vectorSum(v), parallelVectorSum(v)))

# compare performance of serial and parallel
res <- benchmark(vectorSum(v),
                  test replications elapsed relative
2 parallelVectorSum(v)          100   0.182    1.000
1         vectorSum(v)          100   0.857    4.709

Note that performance gains will typically be 30-50% less on Windows systems
as a result of less sophisticated thread scheduling (RcppParallel does not
currently use TBB on Windows
whereas it does on the Mac and Linux).

You can learn more about using RcppParallel at

To leave a comment for the author, please follow the link and comment on their blog: Rcpp Gallery. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)