Lee E. Edlefsen – Scalable Data Analysis in R (useR! 2011)

[This article was first published on Why? » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The RevoScaleR package isn’t open source, but it is free for academic users.

Collect and storing data has outpaced our ability to analyze it. Can R cope with this challenge? The RevoScaleR package is part of the revolution R Enterprise. This package provides data management and data analysis. Uses multiple cores and should scale.

Scalability

What is scalability – from small in-memory data.frame to multi-terabyte data sets distributed across space and even time.  Key to solving this problem is being able to process more data than can fit into the memory at a single time. Data is processed in chunks.

Two main problems: capacity (memory problems) and speed (too slow). Most commonly used statistical software tools can’t handle large data. We still think in terms of “small data sets”.

High performance analytics = HPC + Data

  • HPC is CPU centric. Lot’s of processing on small amounts of data.
  • HPA is data centric. Less processing per amount of data. Needs efficient threading and data management. Key to this is data chunking
Revolutions approach this problem by having a set of R functions (written in C++). Try to keep things familiar. Analysis tools should work on small and large problems. The outputs should be standard R objects. Sample code for logistic regression looks very similar to standard R functions. To run the logistic function on a cluster, just change the “compute context” – a simple function call.
External memory applications allow automatic parallelisation. They split a job into tasks that operate on separate blocks data. Parallel algorithms split the task into separate jobs that can be run together – I think.

Example

  • Initialization task: total = 0, count = 0;
  • Process data tasks: for each block of x, total =sum(x), count = length(x);
  • Update results: combine total and count;
  • Process results.

ScaleR

ScaleR can process data from a variety of formats. It uses it’s own optimized format (XDF) that is suitable for chunking. XDF format:

  • data is stored in blocks of rows
  • header is at the end
  • allows sequential reds
  • essentially unlimited in size
  • Efficient desk space usage.
Airline example: Results seem impressive and scale well. Compared to SAS it seems to do very well.

To leave a comment for the author, please follow the link and comment on their blog: Why? » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)