Converting C code to C++ code: An example from plyr

December 2, 2013
By

(This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers)

The plyr package uses a couple of small C functions to optimise a number of particularly bad bottlenecks. Recently, two functions were converted to C++. This was mostly stimulated by a segmentation fault caused by some large inputs to the split_indices() function: rather than figuring out exactly what was going wrong with the complicated C code, it was easier to rewrite with simple, correct C++ code.

The job of split_indices() is simple: given a vector of integers, x, it returns a list where the i-th element of the list is an integer vector containing the positions of x equal to i. This is a useful building block for many of the functions in plyr.

It is fairly easy to see what is going on the in the C++ code:

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::exports]]
std::vector<std::vector<int> > split_indices(IntegerVector x, int n = 0) {
    if (n < 0) stop("n must be a positive integer");
  
    std::vector<std::vector<int> > ids(n);
  
    int nx = x.size();
    for (int i = 0; i < nx; ++i) {
        if (x[i] > n) {
           ids.resize(x[i]);
        }
    
        ids[x[i] - 1].push_back(i + 1);
    }
  
    return ids;
}
  • We create a std::vector of integers called out. This will grow efficiently as we add new values, and Rcpp will automatically convert to a list of integer vectors when returned to R.

  • The loop iterates through each element of x, adding its index to the end of out. It also makes sure that out is long enough. (The plus and minus ones are needed because C++ uses 0 based indices and R uses 1 based indices.)

The code is simple, easy to understand (if one is a little familiar with the STL), and performant. Compare it to the original C code:

#include <R.h>
#include <Rdefines.h>

SEXP split_indices(SEXP group, SEXP n) {
    SEXP vec;
    int i, j, k;

    int nlevs = INTEGER(n)[0];
    int nobs = LENGTH(group);  
    int *pgroup = INTEGER(group);
  
    // Count number of cases in each group
    int counts[nlevs];
    for (i = 0; i < nlevs; i++)
        counts[i] = 0;
    for (i = 0; i < nobs; i++) {
        j = pgroup[i];
        if (j > nlevs) error("n smaller than largest index");
        counts[j - 1]++;
    }

    // Allocate storage for results
    PROTECT(vec = allocVector(VECSXP, nlevs));
    for (i = 0; i < nlevs; i++) {
        SET_VECTOR_ELT(vec, i, allocVector(INTSXP, counts[i]));
    }

    // Put indices in groups
    for (i = 0; i < nlevs; i++) {
        counts[i] = 0;
    }
    for (i = 0; i < nobs; i++) {
        j = pgroup[i] - 1;
        k = counts[j];
        INTEGER(VECTOR_ELT(vec, j))[k] = i + 1;
        counts[j]++;
    }
    UNPROTECT(1);
    return vec;
}

This function is almost three times as long, and has a bug in it. It is substantially more complicated because it:

  • has to take care of memory management with PROTECT and UNPROTECT; Rcpp takes care of this for us

  • needs an additional loop through the data to determine how long each vector should be; the std::vector grows efficiently and eliminates this problem

Conversion to C++ can make code shorter and easier to understand and maintain, while remaining just as performant.

To leave a comment for the author, please follow the link and comment on his blog: Rcpp Gallery.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.