The RevoScaleR package isn’t open source, but it is free for academic users.
Collect and storing data has outpaced our ability to analyze it. Can R cope with this challenge? The RevoScaleR package is part of the revolution R Enterprise. This package provides data management and data analysis. Uses multiple cores and should scale.
Scalability
What is scalability – from small in-memory data.frame to multi-terabyte data sets distributed across space and even time. Key to solving this problem is being able to process more data than can fit into the memory at a single time. Data is processed in chunks.
Two main problems: capacity (memory problems) and speed (too slow). Most commonly used statistical software tools can’t handle large data. We still think in terms of “small data sets”.
High performance analytics = HPC + Data
- HPC is CPU centric. Lot’s of processing on small amounts of data.
- HPA is data centric. Less processing per amount of data. Needs efficient threading and data management. Key to this is data chunking
Example
- Initialization task: total = 0, count = 0;
- Process data tasks: for each block of x, total =sum(x), count = length(x);
- Update results: combine total and count;
- Process results.
ScaleR
ScaleR can process data from a variety of formats. It uses it’s own optimized format (XDF) that is suitable for chunking. XDF format:
- data is stored in blocks of rows
- header is at the end
- allows sequential reds
- essentially unlimited in size
- Efficient desk space usage.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).