by Joseph Rickert
The following is a brief report of all things R encountered in my not quite random, but nevertheless far from determined, walk through the O'Reilly Strata / Hadoop World Conference held this week in NYC. To start off, I had the pleasure of doing a 9:00 AM Monday morning joint tutorial with Antonio Piccolboni, the principal developer of RHadoop, on “Using R and Hadoop for Statistical Computation at Scale”. Antonio began with a no-nonsense presentation of detailed examples showing how to use the functions of the rmr2 and plyrmr packages to write map-reduce jobs. Antonio worked hard for a full two hours going line-by-line through code that performs data reduction and manipulation, cross tabulations and other practical tasks. The audience of 120 plus people stayed right there with Antonio the whole time. The slides including all of the code from Antonio’s tutorial are available on the Conference website.
For my part, I explained the how the parallel external memory algorithms in the newly announced V7 version of Revolution R Enterprise run directly on Hadoop. (My slides are available here.) I have previously posted code giving a preliminary look at how you can go from running algorithms such a as rxLogit, a high performance implementation of logistic regression first on PC and then on a remote Hadoop cluster just by executing a few lines of R code that point to the cluster as a new “compute context”. Nevertheless, seeing things run live makes an impact.
So, why two approaches to running R in Hadoop? Well, why not? A great strength of R is that developers often simultaneously explore alternate approaches to adding some new capability. In this case, I can imagine that data scientists, comfortable with writing map-reduce jobs, might appreciate the option to use plyrmr’s functions to write custom data transformations. On the other hand, statisticians now have the ability to build GLMs, decision trees and other models directly on large samples stored in the HDFS file system of a Hadoop cluster without having to know much about Hadoop at all.
With the dominant focus of the conference being on Hadoop, I didn’t really expect to hear much talk about R. This turned out not to be the case. R has some serious mindshare among the hardcore Hadoop crowd, both in the hallway conversations, and in the presentations too. For example, in this talk: “Hadoop & Data Science for the Enterprise, 30 Tips and Tricks”, Mark Slusar distills a career’s worth of nitty-gritty Hadoop experience into 30 pointed recommendations. In tip number 12 he says straight out: “Use and Learn R packages (they are) huge time-servers”. There is no equivocating here. R is just something you need to know.
Then in a Plenary session talk intended to be provocative, “Beyond R and Ph.D.s : The Mythology of Data Science Debunked”, Douglas Merrill used R as a stand in for the the quantitative aspects of Data Science itself. if nothing else, this serves to establish R’s ubiquity and name recognition in the data science and Hadoop community.
Finally, I had the opportunity to see an R driven demo of the open source H2O software from 0xdata (prounced hexadata) that is making a bit of a splash. 0xdata is a Mountain View-based startup that has implemented a number of impressive statistical and machine learning algorithms (including GBM) in Java to run on Hadoop. The Java algorithms may be accessed by running functions from the h2o R package, wrapper that makes JSON calls to an instance of the H2O software that must be running concurrently with R. The R package and H2O instance may be downloaded from the 0xdata site.