In yesterday's webinar, “New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis“, Sue Ranney demonstrated the features of the RevoScaleR big data analysis package included with Revolution R Enterprise. In the webinar, she showed how to use the rxImport function to import big data sets from SAS, SPSS or ODBC, how to use the rxDataStep function to pre-process the data using R functions, and how to scale the analysis from a desktop to an entire cluster without changing code, simply by setting a new compute context.
Sue also presented a novel data analysis, based on the US birth-data files mentioned in Joe Adler's R in a Nutshell:
The natality files are gigantic: they're approximately 3.1 Gb uncompressed [and that's for just one year of data — ed]. That's a little larger than R can easily process …
(You can download the data files from the CDC.) Sue showed how to read all 22 years of data (about 70Gb of raw data) into R with RevoScaleR (yielding a 16Gb XDF file, after selecting the relevant columns in the data step), and then fit a linear model to look a the difference in male/female birth rates for difference declared races:
I had no idea that there was a significant difference between boy/girl birth rates at the population level, let alone between the various sub-populations. I guess that's why I'm not a demographer, but sure I found it interesting. (If you'd like to see a more in-depth presentation about this analysis, check out Sue's presentation at useR! 2011.)
Sue's slides from the webinar presentation are below, and you can also download a full replay of the presentation at from the Revolution Analytics webinar archives.
Revolution Analytics Webinars: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis