Why RcppDynProg is Written in C++
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The (matter of opinion) claim:
“When the use of C++ is very limited and easy to avoid, perhaps it is the best option to do that […]”
(source discussed here)
got me thinking: does our own RcppDynProg package actually use C++ in a significant way? Could/should I port it to C? Am I informed enough to use something as complicated as C++ correctly?
RcppDynProg implements a nifty concise dynamic programming solution to a segmentation problem. It can automatically partition graphs such as the following:
into the following:
(details found here).
But is the package really using C++ in any significant way? The implementation is just the usual sort of index chasing needed to fill in a dynamic programming table. Looking at it superficially, the package is not doing anything deep or really using and C++ libraries in a fundamentally interesting manner.
But then it hit me: the package is indexing into arrays. With native C pointer types we would not have any bounds checking on the indexing. With the C++ classes we get bounds checking. This may seem like a small thing, but it is huge. With C pointer types if you have an out of bounds indexing error when writing a value: you may corrupt memory and that can have fairly unbounded consequences. With C++ an out of bounds indexing error causes an exception, code that executes without exception is then a proof the execution didn’t attempt out of bounds indexing.
So RcppDynProg is using C++ in a significant way: it is using it for safety guarantees on array indexing. R users expect safety guarantees on array indexing, as it is a service R supplies. So an extension package that incorporates index bounds checking can be “more R like.” This simple point makes me think many “doesn’t seem to be using C++ in any deep way” packages are also acquiring deep benefits in using C++.
Are there risks in using something as involved as a combination of R, C++, and Rcpp all at the same time for small new project?
Yes.
But I have tried to mitigate them. I have not used new/delete (used only stack-allocated C++ objects), use reference arguments (to try and minimize object construction/destruction), not defined classes with non-trivial destructors, not knowing called back to R functions (though I am using some Rcpp adapted data structures), and generally tried to stay in a generic tame sub-dialect of C++.
I would be happy to incorporate any polite critiques/improvements of the C++ code (found here). If there is something that is obviously wrong to an expert, I would be happy to move to what is obviously right to the experts. (Frankly the thing that most concerns me is: correctly modeling class lifetime and interaction-with/protection-from R’s garbage collector. I think I coded in a style that allows Rcpp to control these issues correctly, but I may stand to be corrected.)
Note: C++ structures such as NumericVector
do in fact index bounds check if you use ()
notation instead of []
notation. RcppDynProg tries to use ()
throughout to get the index bounds checking. Below is a quick example of the difference.
library("Rcpp") f_good <- cppFunction('NumericVector oob(NumericVector x) { int n = x.size(); if(n>0) { x(0) = 5.0; // in bounds } return x; }') f_good(c(1, 2)) # [1] 5 2 f_bad1 <- cppFunction('NumericVector oob(NumericVector x) { int n = x.size(); x(n+10) = 5.0; // out of bounds, checked return x; }') f_bad1(c(1, 2)) # Error in f_bad1(c(1, 2)) : Index out of bounds: [index=12; extent=2]. f_bad2 <- cppFunction('NumericVector oob(NumericVector x) { int n = x.size(); x[n+10] = 5.0; // out of bounds, not checked- memory corruption return x; }') f_bad2(c(1, 2)) # [1] 1 2 # and R crashes out
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.