Lee E. Edlefsen – Scalable Data Analysis in R (useR! 2011)

Posted on August 17, 2011 by csgillespie in R bloggers | 0 Comments

[This article was first published on Why? » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The RevoScaleR package isn’t open source, but it is free for academic users.

Collect and storing data has outpaced our ability to analyze it. Can R cope with this challenge? The RevoScaleR package is part of the revolution R Enterprise. This package provides data management and data analysis. Uses multiple cores and should scale.

Scalability

What is scalability – from small in-memory data.frame to multi-terabyte data sets distributed across space and even time. Key to solving this problem is being able to process more data than can fit into the memory at a single time. Data is processed in chunks.

Two main problems: capacity (memory problems) and speed (too slow). Most commonly used statistical software tools can’t handle large data. We still think in terms of “small data sets”.

High performance analytics = HPC + Data

HPC is CPU centric. Lot’s of processing on small amounts of data.
HPA is data centric. Less processing per amount of data. Needs efficient threading and data management. Key to this is data chunking

Revolutions approach this problem by having a set of R functions (written in C++). Try to keep things familiar. Analysis tools should work on small and large problems. The outputs should be standard R objects. Sample code for logistic regression looks very similar to standard R functions. To run the logistic function on a cluster, just change the “compute context” – a simple function call.

External memory applications allow automatic parallelisation. They split a job into tasks that operate on separate blocks data. Parallel algorithms split the task into separate jobs that can be run together – I think.

Example

Initialization task: total = 0, count = 0;
Process data tasks: for each block of x, total =sum(x), count = length(x);
Update results: combine total and count;
Process results.

ScaleR

ScaleR can process data from a variety of formats. It uses it’s own optimized format (XDF) that is suitable for chunking. XDF format:

data is stored in blocks of rows
header is at the end
allows sequential reds
essentially unlimited in size
Efficient desk space usage.

Airline example: Results seem impressive and scale well. Compared to SAS it seems to do very well.

To leave a comment for the author, please follow the link and comment on their blog: Why? » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Lee E. Edlefsen – Scalable Data Analysis in R (useR! 2011)

Scalability

High performance analytics = HPC + Data

Example

ScaleR

Related

Scalability

High performance analytics = HPC + Data

Example

ScaleR

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)