Lee E. Edlefsen – Scalable Data Analysis in R (useR! 2011)

August 17, 2011
By

(This article was first published on Why? » R, and kindly contributed to R-bloggers)

The RevoScaleR package isn’t open source, but it is free for academic users.

Collect and storing data has outpaced our ability to analyze it. Can R cope with this challenge? The RevoScaleR package is part of the revolution R Enterprise. This package provides data management and data analysis. Uses multiple cores and should scale.

Scalability

What is scalability – from small in-memory data.frame to multi-terabyte data sets distributed across space and even time.  Key to solving this problem is being able to process more data than can fit into the memory at a single time. Data is processed in chunks.

Two main problems: capacity (memory problems) and speed (too slow). Most commonly used statistical software tools can’t handle large data. We still think in terms of “small data sets”.

High performance analytics = HPC + Data

  • HPC is CPU centric. Lot’s of processing on small amounts of data.
  • HPA is data centric. Less processing per amount of data. Needs efficient threading and data management. Key to this is data chunking
Revolutions approach this problem by having a set of R functions (written in C++). Try to keep things familiar. Analysis tools should work on small and large problems. The outputs should be standard R objects. Sample code for logistic regression looks very similar to standard R functions. To run the logistic function on a cluster, just change the “compute context” – a simple function call.
External memory applications allow automatic parallelisation. They split a job into tasks that operate on separate blocks data. Parallel algorithms split the task into separate jobs that can be run together – I think.

Example

  • Initialization task: total = 0, count = 0;
  • Process data tasks: for each block of x, total =sum(x), count = length(x);
  • Update results: combine total and count;
  • Process results.

ScaleR

ScaleR can process data from a variety of formats. It uses it’s own optimized format (XDF) that is suitable for chunking. XDF format:

  • data is stored in blocks of rows
  • header is at the end
  • allows sequential reds
  • essentially unlimited in size
  • Efficient desk space usage.
Airline example: Results seem impressive and scale well. Compared to SAS it seems to do very well.

To leave a comment for the author, please follow the link and comment on his blog: Why? » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.