Handling big data sets has always been a concern for R users. Once the size of the data set reaches above 50% of RAM, it is considered “massive” and can literally become impossible to work with on a standard machine. The bigmemory project, by Michael Kane and Jay Emerson, is one approach to dealing with this class of data set. Last Monday, December 13th, the New England R Users Group warmly welcomed Michael Kane to talk about bigmemory and R.
Bigmemory is one package (of 5 in the bigmemory project) which is designed to extend R to better handle large data sets. The core data structures are written in C++ and allow R users to create matrix-like data objects (called big.matrix). These big.matrix objects are compatible with standard R matrices, allowing them to be used wherever a standard matrix can. The backing store for a big.matrix is a memory-mapped file, allowing it to take on sizes much larger than available RAM.
Mike discussed the use of bigmemory on the well-known Airline on-time data which includes over 20 years of data on roughly 120 million commercial US airline flights. The data set is roughly 12 GB in size and considering that read.table() recommends the maximum data size to be 10%-20% of RAM, it is nearly impossible to work with on a standard machine. However, bigmemory allows you to read in and analyze the data without problems.
Mike also showed how bigmemory can be used with the MapReduce (or split-apply-combine) method to greatly reduce the time required by many statistical calculations. For example, if one were trying to determine if older planes suffer greater delays, you need to know how old each of the 13,000 planes are. This calculation, running on a standard 1 core system is estimated to require nearly 9 hours to compute. Even when running in parallel on 4 cores, it can take nearly 2 hours. However, using bigmemory and the split-apply-combine method, the computation takes a little over one minute!
The bigmemory project was recently awarded the 2010 John M. Chambers Statistical Software Award and was presented to Mike at the 2010 Joint Statistical Meetings held in August.