The Workflow of Infinite Shame, and other stories from the R Summit
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
At day one of the R Summit at Copenhagen Business School there was a lot of talk about the performance of R, and alternate R interpreters.
Luke Tierney of the University of Iowa, the author of the compiler package, and R-Core member who has been working on R’s performance since, well, pretty much since R was created, talked about future improvements to R’s internals.
Plans to improve R’s performance include implementing proper reference counting (that is tracking how many variables point at a particular bit of memory; the current version counts like zero/one/two-or-more, and a more accurate count means you can do less copying). Improving scalar performance and reducing function overhead are high priorities for performance enhancement. Currently when you do something like
for(i in 1:100000000) {}
R will assign a vector of length 100000000, which takes a ridiculous amount of memory. By being smart and realising that you only ever need one number at a time, you can store the vector much more efficiently. The same principle applies for seq_len
and seq_along
.
Other possible performance improvements that Luke discussed include having a more efficient data structure for environments, and storing the results of complex objects like model results more efficiently. (How often do you use that qr
element in an lm
model anyway?
Tomas Kalibera of Northeastern University has been working on a tool for finding PROTECT
bugs in the R internals code. I last spoke to Tomas in 2013 when he was working with Jan Vitek on the alternate R engine, FastR. See Fearsome Engines part 1, part 2, part 3. Since then FastR has become a purely Oracle project (more on that in a moment), and the Purdue University fork of FastR has been retired.
The exact details of the PROTECT
macro went a little over my head, but the essence is that it is used to stop memory being overwritten, and it’s a huge source of obscure bugs in R.
Lukas Stadler of Oracle Labs is the heir to the FastR throne. The Oracle Labs team have rebuilt it on top of Truffle, an Oracle product for generating dynamically optimized Java bytecode that can then be run on the JVM. Truffle’s big trick is that it can auto-generate this byte code for a variety of languages: R, Ruby and JavaScript are the officially supported languages, with C, Python and SmallTalk as side-projects. Lukas claimed that peak performance (that is, “on a good day”) for Truffle-generated code is comparable to language-specific optimized code.
Non-vectorised code is the main beneficiary of the speedup. He had a cool demo where a loopy version of the sum function ran slowly, then Truffle learned how to optimise it, and the result became almost as fast as the built-in sum function.
He has a complaint that the R.h
API from R to C is really and API from GNU R to C, that is, it makes too many assumptions about how GNU works, and these don’t hold true when you are running a Java version of R.
Maarten-Jan Kallen from BeDataDriven works on Renjin, the other R interpreter built on top of the JVM. Based on his talk, and some other discussion with Maarten, it seems that there is a very clear mission-statement for Renjin: BeDataDriven just want a version of R that runs really fast inside Google App Engine. They also count an interesting use case forRenjin – it is currently powering software for the United Nations’ humanitarian effort in Syria.
Back to the technical details, Maarten showed an example where R 3.0.0 introduced the anyNA
function as a fast version of any(is.na(x))
. In the case of Renjin, this isn’t necessary since it works quickly anyway. (Though if Luke Tierney’s talk come true, it won’t be needed in GNU R soon either.)
Calling external code still remains a problem for Renjin; in particular Rcpp
and it’s reverse dependencies won’t build for it. The spread of Rcpp, he lamented, even includes roxygen2
.
Hannes Mühleisen has also been working with BeDataDriven, and completed Maarten’s talk. He previously worked on integrating MonetDB with R, and has been applying his database expertise to Renjin. In the same way that when you run a query in a database, it generates a query plan to try and find the most efficient way of retrieving your results, Renjin now generates a query plan to find the most efficient way to evaluate your code. That means using a deferred execution system where you avoid calculating things until the last minute, and in some cases not at all because another calculation makes them obsolete.
Karl Millar from Google has been working on CXXR. This was a bit of a shock to me. When I interviewed Andrew Runnalls in 2013, he didn’t really sell CXXR to me particularly well. The project goals at the time were to clean up the GNU R code base, rewriting it in modern C++, and documenting it properly to use as a reference implementation of R. It all seemed a bit of academic fun rather than anything useful. Since Google has started working on the project, the focus has changed. It is now all about having a high performance version of R.
I asked Andrew why he choose CXXR for this purpose. After all, of the half a dozen alternate R engines, CXXR was the only one that didn’t explicitly have performance as a goal. His response was that it has nearly 100% code compatibility with GNU R, and that the code is so clear that it makes it easy to make changes.
That talk focussed on some of the difficulties of optimizing R code. For example, in the assignment
a <- b + c
you don’t know how long b
and c
are, or what their classes are, so you have to spend a long time looking things up. At runtime however, you can guess a bit better. b
and c
are probably the same size and class as what you used last time, so guess that first.
He also had a little dig at Tomas Kalibera’s work, saying that CXXR has managed to eliminate almost all the PROTECT
macros in its codebase.
Radford Neal talked about some optimizations in his pqR project, which uses both interpreted and byte-compiled code.
In interpreted code, pqR uses a “variant result” mechanism, which sounded similar to Renjin’s “deferred execution”.
A performance boost comes from having a fast interface to unary primitives. This also makes eval
faster. Another boost comes fro ma smarter way to not look for variables in certain frames. For example, a list of which frames contain overrides for special symbols (+
, [
, if
, etc.) is maintained, so calling them is faster.
Matt Dowle (“the data.table guy”) of H2O gave a nice demo of H2O Flow, a slick web-based GUI for machine learning on big datasets. It does lots of things in parallel, and is scriptable.
Indrajit Roy of HP Labs and Michael Lawrence of Genentech gave a talk on distributed data structures. These seem very good for cases where you need to access your data from multiple machines.
The SparkR package gives access to distributed data structures with a Spark backend, however Indrajit wasn’t keen, saying that it is too low-level to be easy to work with.
Instead he and Michael have developed the dds
package that gives a standard interface for using distributed data structures. The package lies on top of Spark.dds
and distributedR.dds
. The analogy is with DBI
providing a stanrdard database interface that uses RSQLite
or RPostgreSQL
underneath.
Ryan Hafen of Tessera talked about their product (I think also called Tessara) for analysing large datasets. It’s a fancy wrapper to MapReduce that also has distributed data objects. I didn’t get chance to ask if they support the dds interface. The R packages of interest are datadr
and trelliscope
.
My own talk was less technical than the others today. It consisted of a series of rants about things I don’t like about R, and how to fix them. The topics included how to remove quirks from the R language (please deprecate indexing with factors), helping new users (let the R community create some vignettes to go in base-R), and how to improve CRAN (CRAN is not a code hosting service, CRAN is a shop for packages). I don’t know if any of my suggestions will be taken up, but my last slide seemed to generate some empathy.
I’ve named my CRAN submission process “The Workflow of Infinite Shame”. What tends to happen is that I check that things work on my machine, submit to CRAN, and about an hour later get a response saying “we see these errors, please fix”. Quite often, especially for things involving locales or writing files, I cannot reproduce the issue, so I fiddle about a bit and guess, then resubmit. After five or six iterations, I’ve lost all sense of dignity, and while R-core are very patient, I’m sure they assume that I’m an idiot.
CRAN currently includes a Win-builder service that lets you submit packages, builds them under Windows, then tells you the results. What I want is an everything-builder service that builds and checks my package on all the necessary platforms (Windows, OS X, Linux, BSD, Solaris on R-release, R-patched, and R-devel), and only if it passes does a member of R-core get to see the problem. That way, R-core’s time isn’t wasted, and more importantly I look like less of an idiot.
Tagged: r, r-summit
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.