Tips on Computing with Big Data in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Joseph Rickert
The Revolution R Enterprise 7.0 Getting started Guide makes a distinction between High Performance Computing (HPC) which is CPU centric, focusing on using many cores to perform lots of processing on small amounts of data, and High Performance Analytics (HPA), data centric computing that concentrates on feeding data to cores, disk I/O, data locality, efficient threading, and data management in RAM. The following collection of tips for computing with big data is an abbreviated version of the Guide’s discussion of the HPC and HPA considerations underlying the design of Revolution R Enterprise 7.0 and RevoScaleR, Revolution’s R package for HPA computing.
1 – Upgrade your hardware
It doesn’t hurt to state the obvious: bigger is better. In general, memory is the most important consideration. Getting more cores can also help, but only up to a point since R itself can generally only use one core at a time internally. Moreover, for many data analysis problems the bottlenecks are disk I/O and the speed of RAM, making it difficult to efficiently use more than 4 or 8 cores on commodity hardware.
2 – Upgrade your software
R allows its core math libraries to be replaced. Doing so can provide a very noticeable performance boost to any function that make use of computational linear algebra algorithms. Revolution R Enterprise links in the Intel Math Kernel Libraries.
3 – Minimize copies of the data
R does quite a bit of automatic copying. For example, when a data frame is passed into a function a copy of the data is made if the data frame is modified, and putting a data frame into a list also automatically causes a copy to be made. Moreover, many basic analysis algorithms, such as lm and glm, produce multiple copies of a data set as the computations progress. Memory management is important.
4 – Process data in chunks
Processing data a chunk at a time is the key to being able to scale computations without increasing memory requirements. External memory algorithms load a manageable amount of data into RAM, perform some intermediate calculations, load the next chunk and keep going until all of the data has been processed. Then, the final result is computed from the set of intermediate results. There are several CRAN packages including biglm, bigmemory, ff and ffbase that either implement external memory algorithms or help with writing them. Revolution R Enterprise’s RevoScaleR package takes chunking algorithms to the next level by automatically taking advantage of the computational resources to run its algorithms in parallel.
5 – Compute in parallel across cores or nodes
Using all of the available cores and nodes is key to scaling computations to really big data. However, since data analysis algorithms tend to be I/O bound when data cannot fit into memory, the use of multiple hard drives can be even more important than the use of multiple cores. The CRAN package foreach provides easy-to-use tools for executing R functions in parallel on both on a single computer and across multiple computers. The foreach() function is particularly useful for “embarrassingly parallel” computations that do not involve communication among different tasks.
The statistical functions and machine learning algorithms in the RevoScaleR package are all Parallel External Memory Algorithm’s (PEMA’s). They automatically take advantage of all of the cores available on a machine or on a cluster (including LSF and Hadoop clusters.)
6 – Take advantage of integers
In R, the two choices for “continuous” data are numeric, an 8 byte (double) floating point number and integer, a 4 byte integer. There are circumstances where storing and processing integer data can provide the dual advantages using less memory and decreasing processing time. For example, when working with integers, a tabulation is generally much faster than sorting and gives exact values for all empirical quantiles. Even when you are not working with integers scaling and converting to integers can produce fast and accurate estimates of quantiles. As an example, if the data consists of floating point values in the range from 0 to 1,000, converting to integers and tabulating will bound the median or any other quantile to within two adjacent integers. Then interpolation can get you even closer approximation.
7 – Store data efficiently
You will want to store big data so that it can efficiently accessed from disk. The use of appropriate data types can save both storage space and access time. Take advantage of integers and, when you can, store data in 32-bit floats not 64-bit doubles. A 32-bit float can represent 7 decimal digits of precision, which is more than enough for most data, and it takes up half the space of doubles. Save the 64-bit doubles for computations.
8 – Only read the data needed
Even though a data set may have many thousands of variables, typically not all of them are being analyzed at one time. By reading from disk just the actual variables and observations you will use in analysis, you can speed up the analysis considerably.
9 – Avoid loops when transforming data
Loops in R can be very slow compared with R’s core vector operations which are typically written in C, C++ or Fortran, compiled languages that execute much quicker than the R interpreter.
10 – Use C, C++, or Fortran for critical functions
One R’s great strengths is its ability to integrate easily with other languages, including C, C++, and Fortran. You can pass R data objects to other languages, do some computations, and return the results in R data objects. The CRAN package Rcpp,for example, makes it easy to call C and C++ code from R.
11 – Process data transformations in batches
When working with small data sets, it is common to perform data transformations one at a time. For instance, one line of code might create a new variable, and subsequent lines perform additional transformations with each transformation requiring a pass through the data. To avoid the overhead of making multiple passes over a large data set write chunking algorithms that apply all of the transformations to each chunk. RevoScaleR’s rxDataStep() function is designed for one pass processing by permitting multiple data transformations to be performed on each chunk.
12 – User row-oriented data transformations where possible
When writing chunking algorithms, try to avoid algorithms that cross chunk boundaries. In general, data transformations for a single row of data should not be dependent on values in other rows. The key idea is that a transformation expression should give the same result even if only some of the rows of data are in memory at one time. Data manipulations requiring lags can be done but require special handling.
13 – Handle categorical variables efficiently and with care
Working with categorical or factor variables in big data sets can be challenging. For starters, not all of the factor levels may be represented in a single chunk of data. Using R’s factor() function in a transformation on a chunk of data without explicitly specifying all of the levels that are present in the entire data set might cause you to end up with incompatible factor levels from chunk to chunk. Also, building models with factors having hundreds of levels may cause hundreds of dummy variables to be created that really eat up memory. The functions in the RevoScaleR package that deal with factors minimize memory use and do not generally explicitly create dummy variables to represent factors.
14 – Be aware of 0utput with the same number of rows as your input
Most analysis functions return a relatively small object of results that can easily be handled in memory. Occasionally, however, output will have the same number of rows as the data: when computing predictions and residuals for example. In order for this to scale, you will want the output written out to a file rather than kept in memory.
15 – Think Twice Before Sorting
Sorting is by nature a time-intensive operation. Do what you can to avoid sorting a large data set. Use functions that compute estimates of medians and quantiles and look for implementations of popular algorithms that avoid sorting. For example, the RevoScaleR function rxDTree() avoids sorting by working with histograms of the data rather that with the raw data itself.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.