Site icon R-bloggers

Converting C code to C++ code: An example from plyr

[This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The plyr package uses a couple of small C functions to optimise a number of particularly bad bottlenecks. Recently, two functions were converted to C++. This was mostly stimulated by a segmentation fault caused by some large inputs to the split_indices() function: rather than figuring out exactly what was going wrong with the complicated C code, it was easier to rewrite with simple, correct C++ code.

The job of split_indices() is simple: given a vector of integers, x, it returns a list where the i-th element of the list is an integer vector containing the positions of x equal to i. This is a useful building block for many of the functions in plyr.

It is fairly easy to see what is going on the in the C++ code:

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::exports]]
std::vector<std::vector<int> > split_indices(IntegerVector x, int n = 0) {
    if (n < 0) stop("n must be a positive integer");
  
    std::vector<std::vector<int> > ids(n);
  
    int nx = x.size();
    for (int i = 0; i < nx; ++i) {
        if (x[i] > n) {
           ids.resize(x[i]);
        }
    
        ids[x[i] - 1].push_back(i + 1);
    }
  
    return ids;
}

The code is simple, easy to understand (if one is a little familiar with the STL), and performant. Compare it to the original C code:

#include <R.h>
#include <Rdefines.h>

SEXP split_indices(SEXP group, SEXP n) {
    SEXP vec;
    int i, j, k;

    int nlevs = INTEGER(n)[0];
    int nobs = LENGTH(group);  
    int *pgroup = INTEGER(group);
  
    // Count number of cases in each group
    int counts[nlevs];
    for (i = 0; i < nlevs; i++)
        counts[i] = 0;
    for (i = 0; i < nobs; i++) {
        j = pgroup[i];
        if (j > nlevs) error("n smaller than largest index");
        counts[j - 1]++;
    }

    // Allocate storage for results
    PROTECT(vec = allocVector(VECSXP, nlevs));
    for (i = 0; i < nlevs; i++) {
        SET_VECTOR_ELT(vec, i, allocVector(INTSXP, counts[i]));
    }

    // Put indices in groups
    for (i = 0; i < nlevs; i++) {
        counts[i] = 0;
    }
    for (i = 0; i < nobs; i++) {
        j = pgroup[i] - 1;
        k = counts[j];
        INTEGER(VECTOR_ELT(vec, j))[k] = i + 1;
        counts[j]++;
    }
    UNPROTECT(1);
    return vec;
}

This function is almost three times as long, and has a bug in it. It is substantially more complicated because it:

Conversion to C++ can make code shorter and easier to understand and maintain, while remaining just as performant.

To leave a comment for the author, please follow the link and comment on their blog: Rcpp Gallery.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.