Is the pain worth it?: Can Rcpp speed up Passing Bablok Regression?

[This article was first published on The Lab-R-torian, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Background

R dogma is that for loops are bad because they are slow but this is not the case in C++. I had never programmed a line of C++ as of last week but my beloved firstborn started university last week and is enrolled in a C++ intro course, so I thought I would try to learn some and see if it would speed up Passing Bablok regression.

Passing Bablok Regression

As mentioned in the past, the field of Clinical Chemistry has a peculiar devotion to Passing Bablok regression… and hey, why not?

Here is the code for a minimal implementation of Passing Bablok regression as discussed in this paper. This is the scale-invariant version.

PB.reg <- function(x,y){
  lx <- length(x)
  l <- lx*(lx - 1)/2
  k <- 0
  S <- rep(NA, lx)
  for (i in 1:(lx - 1)) {
    for (j in (i + 1):lx) {
            k <- k + 1
            S[k] <- (y[i] - y[j])/(x[i] - x[j])
    }
  }
  S.sort <- sort(S)
  N <- length(S.sort)
  neg <- length(subset(S.sort,S.sort < 0))
  K <- floor(neg/2)
  if (N %% 2 == 1) {
      b <- S.sort[(N+1)/2+K]
  } else {
      b <- sqrt(S.sort[N / 2 + K]*S.sort[N / 2 + K + 1])
  }
  a <- median(y - b * x)
  return(as.vector(c(a,b)))
}

So let’s make some fake data and see what we get:

set.seed(314)
x <- seq(1, 100, length.out = 100)
y <- 1.1* x + 20 + x*rnorm(100,0,0.10)
reg <- PB.reg(x,y)
round(reg,2)

## [1] 18.90  1.14

Just to sanity check, we can get the coefficients from the mcr() package.

library(mcr)
reg.mcr <- mcreg(x,y)
round([email protected],2)

## [1] 18.90  1.14

Ok, looks good.

Rcpp

The Rcpp package permits compiling and execution of C++ code from within R. This can be good for computationally intensive tasks. Here is my child-like attempt at recapitulation of the R script above.

#include <iostream>
#include <vector>
#include <cmath>
#include <limits>
#include <algorithm>
#include <numeric>
#include "Rcpp.h"
using namespace std;
using namespace Rcpp;
// [[Rcpp::plugins("cpp11")]]
// [[Rcpp::export]]

vector <double> PB(vector<double> x,vector<double> y){
  int lx=x.size();
  int l=lx*(lx-1)/2;
  int k=-1;
  vector <double> S(l,numeric_limits<double>::quiet_NaN());
  vector <double> sortS;
  vector <double> neg;
  double b;
  double a;
  vector <double> aVec(x.size());
  vector <double> coef(2);

  // Calculate the components of the S Matrix
  for(int i=0; i<lx-1;i++){
    for(int j=i+1; j<lx;j++){
        k+=1;
        S[k]=(y[i]-y[j])/(x[i]-x[j]);
    }
  }

  // Sort the S Matrix
  sort(S.begin(),S.end());

  // Delete any undefined values from S
  for(vector <double> :: iterator it = S.begin(); it != S.end(); ++it){
      if (!isnan(*it)) sortS.push_back((*it));
  }

  // Put all the negative values into a vector called neg
  for(vector <double> :: iterator it = S.begin(); it != S.end(); ++it){
      if ((*it)<0) neg.push_back((*it));
  }

  // Calculate N and K
  int K=int(floor(float(neg.size()/2)));
  int N=sortS.size();

  // Calculate the slope
  if (N%2==1) {
    b=S[static_cast<int> ((N+1)/2+K-1)];
  } else {
    b=sqrt(S[static_cast<int> (N/2+K-1)]*S[static_cast<int> (N/2+K)]);
  }

  // Make a vector aVec of the estimates of the intercept
  for(int i=0; i<x.size(); i++){
      aVec[i]=y[i]-b*x[i];
  }
  sort(aVec.begin(),aVec.end());

  // Calculate the median from the sorted values of aVec
  if(x.size()%2==1){
    a=aVec[static_cast<int>((x.size()+1.0)/2.0-1.0)];
  }else{
    a=(aVec[static_cast<int> (x.size()/2.0-1.0)] + aVec[static_cast<int> (x.size()/2.0)])/2.0;
  }

  // Report results
  coef[0]=a;
  coef[1]=b;
  return coef;
}

int main(){
  return 0;
}

Saving this to a local directory as PB_Reg.cpp, we can compile it and call the function into the R session as follows:

library(Rcpp)
sourceCpp("PB_Reg.cpp")

And run it:

round(PB(x,y),2)

## [1] 18.90  1.14

Looks good! But is it faster? Like really faster? Let’s find out. And let’s also compare it to what the compiler() package produces:

library(compiler)
PB.reg.cmpfun <- cmpfun(PB.reg)

library(rbenchmark)
bmark <- benchmark(
  Rscript = PB.reg(x,y),
  `Compiler Package` = PB.reg.cmpfun(x,y),
  Rcpp = PB(x,y),
  replications = 100,
  order = "relative"
)
knitr::kable(bmark)

test replications elapsed relative user.self sys.self user.child sys.child
3 Rcpp 100 0.024 1.000 0.024 0.000 0 0
2 Compiler Package 100 0.214 8.917 0.214 0.000 0 0
1 Rscript 100 0.217 9.042 0.213 0.004 0 0

So this makes C++ code about 9x faster than R source code. Not bad but in this case it was probably not worth the effort. Oh well. I learned something. Incidentally, the compiler() package did not help much in this case as it is only marginally better than raw R source. I have seen better results from compiler() in other cases.

Final thought

Speaking of speed:

“Free yourself, like a gazelle from the hand of the hunter, like a bird from the snare of the fowler.”

Proverbs 6:5


To leave a comment for the author, please follow the link and comment on their blog: The Lab-R-torian.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)