Michael Kane on Bigmemory

[This article was first published on Josh Paulson's Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Handling big data sets has always been a concern for R users. Once the size of the data set reaches above 50% of RAM, it is considered “massive” and can literally become impossible to work with on a standard machine. The bigmemory project, by Michael Kane and Jay Emerson, is one approach to dealing with this class of data set. Last Monday, December 13th, the New England R Users Group warmly welcomed Michael Kane to talk about bigmemory and R.

Bigmemory is one package (of 5 in the bigmemory project) which is designed to extend R to better handle large data sets. The core data structures are written in C++ and allow R users to create matrix-like data objects (called big.matrix). These big.matrix objects are compatible with standard R matrices, allowing them to be used wherever a standard matrix can.  The backing store for a big.matrix is a memory-mapped file, allowing it to take on sizes much larger than available RAM.

Mike discussed the use of bigmemory on the well-known Airline on-time data which includes over 20 years of data on roughly 120 million commercial US airline flights. The data set is roughly 12 GB in size and considering that read.table() recommends the maximum data size to be 10%-20% of RAM, it is nearly impossible to work with on a standard machine. However, bigmemory allows you to read in and analyze the data without problems.

Mike also showed how bigmemory can be used with the MapReduce (or split-apply-combine) method to greatly reduce the time required by many statistical calculations. For example, if one were trying to determine if older planes suffer greater delays, you need to know how old each of the 13,000 planes are. This calculation, running on a standard 1 core system is estimated to require nearly 9 hours to compute. Even when running in parallel on 4 cores, it can take nearly 2 hours. However, using bigmemory and the split-apply-combine method, the computation takes a little over one minute!

The bigmemory project was recently awarded the 2010 John M. Chambers Statistical Software Award and was presented to Mike at the 2010 Joint Statistical Meetings held in August.

To leave a comment for the author, please follow the link and comment on their blog: Josh Paulson's Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)