I’ve hinted this was coming a few times before, but with today’s press release the announcement is official: the next release of Revolution R Enterprise will include "Big Data" capabilities thanks to the new RevoScaleR package. We’re pretty excited at how it’s turned out: it’s kinda amazing to be able to use R’s formula syntax like this:
arrDelayLm2 <- rxLinMod(ArrDelay ~ DayOfWeek:F(CRSDepTime))
and be able to do a regression on 120+million rows (more than 13Gb) of data in just a few seconds using an ordinary laptop. With the more powerful multicore machines now on the market, the parallel processing algorithms in RevoScaleR really scream. You can see some of the details of how the RevoScaleR package works in the white paper "Big Data Analysis with Revolution R Enterprise", and I’ll be giving a presentation about the new Big Data capabilities of Revolution R Enterprise in a live webcast on August 25.
One aspect of the announcement that really seems to have generated attention is how you can use Revolution R Enterprise with Hadoop to process super-massive data sets. (As I write this, before the press release is even on the wires, this article by Dave Rosenberg at Cnet has already been retweeted over 100 times.) Hadoop and RevoScaleR complement each other well: like a freight train, Hadoop can do the heavy lifting of preprocessing a distributed data set to get it ready for statistical analysis, and then, like a race car, RevoScaleR fits the statistical model. We’ll be coming out very soon with a white paper authored by Saptarshi Guha (author of the the Rhipe integration between Hadoop and R) demonstrating how he used Hadoop to extract out individual conversations from packet-level VOIP data, and then used RevoScaleR to perform a regression analysis on those calls. We’ll have more information about that analysis here in the blog in the next couple of weeks.
Revolution Analytics: Revolutionary New Levels of Performance and Scalability for Big Data Analysis