Is the pain worth it?: Can Rcpp speed up Passing Bablok Regression?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Background
R dogma is that for loops are bad because they are slow but this is not the case in C++. I had never programmed a line of C++ as of last week but my beloved firstborn started university last week and is enrolled in a C++ intro course, so I thought I would try to learn some and see if it would speed up Passing Bablok regression.
Passing Bablok Regression
As mentioned in the past, the field of Clinical Chemistry has a peculiar devotion to Passing Bablok regression… and hey, why not?
Here is the code for a minimal implementation of Passing Bablok regression as discussed in this paper. This is the scale-invariant version.
PB.reg <- function(x,y){
lx <- length(x)
l <- lx*(lx - 1)/2
k <- 0
S <- rep(NA, lx)
for (i in 1:(lx - 1)) {
for (j in (i + 1):lx) {
k <- k + 1
S[k] <- (y[i] - y[j])/(x[i] - x[j])
}
}
S.sort <- sort(S)
N <- length(S.sort)
neg <- length(subset(S.sort,S.sort < 0))
K <- floor(neg/2)
if (N %% 2 == 1) {
b <- S.sort[(N+1)/2+K]
} else {
b <- sqrt(S.sort[N / 2 + K]*S.sort[N / 2 + K + 1])
}
a <- median(y - b * x)
return(as.vector(c(a,b)))
}So let’s make some fake data and see what we get:
set.seed(314) x <- seq(1, 100, length.out = 100) y <- 1.1* x + 20 + x*rnorm(100,0,0.10) reg <- PB.reg(x,y) round(reg,2)
## [1] 18.90 1.14

Just to sanity check, we can get the coefficients from the mcr() package.
library(mcr) reg.mcr <- mcreg(x,y) round([email protected],2)
## [1] 18.90 1.14
Ok, looks good.
Rcpp
The Rcpp package permits compiling and execution of C++ code from within R. This can be good for computationally intensive tasks. Here is my child-like attempt at recapitulation of the R script above.
#include <iostream>
#include <vector>
#include <cmath>
#include <limits>
#include <algorithm>
#include <numeric>
#include "Rcpp.h"
using namespace std;
using namespace Rcpp;
// [[Rcpp::plugins("cpp11")]]
// [[Rcpp::export]]
vector <double> PB(vector<double> x,vector<double> y){
int lx=x.size();
int l=lx*(lx-1)/2;
int k=-1;
vector <double> S(l,numeric_limits<double>::quiet_NaN());
vector <double> sortS;
vector <double> neg;
double b;
double a;
vector <double> aVec(x.size());
vector <double> coef(2);
// Calculate the components of the S Matrix
for(int i=0; i<lx-1;i++){
for(int j=i+1; j<lx;j++){
k+=1;
S[k]=(y[i]-y[j])/(x[i]-x[j]);
}
}
// Sort the S Matrix
sort(S.begin(),S.end());
// Delete any undefined values from S
for(vector <double> :: iterator it = S.begin(); it != S.end(); ++it){
if (!isnan(*it)) sortS.push_back((*it));
}
// Put all the negative values into a vector called neg
for(vector <double> :: iterator it = S.begin(); it != S.end(); ++it){
if ((*it)<0) neg.push_back((*it));
}
// Calculate N and K
int K=int(floor(float(neg.size()/2)));
int N=sortS.size();
// Calculate the slope
if (N%2==1) {
b=S[static_cast<int> ((N+1)/2+K-1)];
} else {
b=sqrt(S[static_cast<int> (N/2+K-1)]*S[static_cast<int> (N/2+K)]);
}
// Make a vector aVec of the estimates of the intercept
for(int i=0; i<x.size(); i++){
aVec[i]=y[i]-b*x[i];
}
sort(aVec.begin(),aVec.end());
// Calculate the median from the sorted values of aVec
if(x.size()%2==1){
a=aVec[static_cast<int>((x.size()+1.0)/2.0-1.0)];
}else{
a=(aVec[static_cast<int> (x.size()/2.0-1.0)] + aVec[static_cast<int> (x.size()/2.0)])/2.0;
}
// Report results
coef[0]=a;
coef[1]=b;
return coef;
}
int main(){
return 0;
}Saving this to a local directory as PB_Reg.cpp, we can compile it and call the function into the R session as follows:
library(Rcpp)
sourceCpp("PB_Reg.cpp")And run it:
round(PB(x,y),2)
## [1] 18.90 1.14
Looks good! But is it faster? Like really faster? Let’s find out. And let’s also compare it to what the compiler() package produces:
library(compiler) PB.reg.cmpfun <- cmpfun(PB.reg) library(rbenchmark) bmark <- benchmark( Rscript = PB.reg(x,y), `Compiler Package` = PB.reg.cmpfun(x,y), Rcpp = PB(x,y), replications = 100, order = "relative" ) knitr::kable(bmark)
| test | replications | elapsed | relative | user.self | sys.self | user.child | sys.child | |
|---|---|---|---|---|---|---|---|---|
| 3 | Rcpp | 100 | 0.024 | 1.000 | 0.024 | 0.000 | 0 | 0 |
| 2 | Compiler Package | 100 | 0.214 | 8.917 | 0.214 | 0.000 | 0 | 0 |
| 1 | Rscript | 100 | 0.217 | 9.042 | 0.213 | 0.004 | 0 | 0 |
So this makes C++ code about 9x faster than R source code. Not bad but in this case it was probably not worth the effort. Oh well. I learned something. Incidentally, the compiler() package did not help much in this case as it is only marginally better than raw R source. I have seen better results from compiler() in other cases.
Final thought
Speaking of speed:
“Free yourself, like a gazelle from the hand of the hunter, like a bird from the snare of the fowler.”
Proverbs 6:5
// add bootstrap table styles to pandoc tables
function bootstrapStylePandocTables() {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
}
$(document).ready(function () {
bootstrapStylePandocTables();
});
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.