At the Strata conference in New York today, Steve Yun (Principal Predictive Modeler at Allstate's Research and Planning Center) described the various ways he tackled the problem of fitting a generalized linear model to 150M records of insurance data. He evaluated several approaches:
- Proc GENMOD in SAS
- Installing a Hadoop cluster
- Using open-source R (both on the full data set, and on using sampling)
- Running the data through Revolution R Enterprise
Steve described each of the approaches as follows.
Approach 1 is current practice. (Allstate is a big user of SAS, but there's a growing contingent of R users.) Proc GENMOD takes around 5 hours to return results for a Poisson model with 150 million observations and 70 degrees of freedom. "It's difficult to be productive on a tight schedule if it takes over 5 hours to fit one candidate models!", Steve said.
Approach 2: It was hoped that installing a Hadoop cluster and running the model there would improve performace. According to Steve, "a lot of plumbing was required: this involved coding the matrix equations for iteratively-reweighted least squares as a map-reduce task (using the rmr package), and manually coding the factor variables as indicator columns in the design matrix. Unfortunately, each iteration took abour 1.5 hours, with 5-10 iterations required to convergence. (Even then, there were problems with singularites in the design matrix.)
Approach 3: Perhaps installing R on a server with lots of RAM would help. (Because open-source R runs in-memory, you need RAM in the order of several times the size of the data to make it work.) Alas, not even a 250Gb server was sufficient: even after waiting three days, the data couldn't even be loaded. Sampling the data down into 10 partitions was more successful, and allowed for the use of the glmnet package and L1 regularization to automate the variable selection process. But each glmnet fit on a partition still took over 30 minutes, and Steve said it would be difficult for managers to accept a process that involved sampling.
Approach 4: Steve turned to Revolution Analytics' Joe Rickert to evaluate how long the same model would take using the big-data RevoScaleR package in Revolution R Enterprise. Joe loaded the data onto a 5-node cluster (20 cores total), and used the distributed rxGlm function, which was able to process the data in 5.7 minutes. Joe demonstrated this process live during the session.
So in summary, here's how the four approaches fared:
Approach |
Platform |
Time to fit |
1: SAS |
16-core Sun Server |
5 hours |
2: rmr / map-reduce |
10-node (8 cores / node) Hadoop |
> 10 hours |
3: Open source R |
250 GB Server |
Impossible (> 3 days) |
4: RevoScaleR |
5-node (4 cores / node) LSF |
5.7 minutes |
That's quite a difference! So what have we learned:
- SAS works, but is slow.
- It's possible to program the model in Hadoop, but it's even slower.
- The data is too big for open-source R, even on a very large server.
- Revolution R Enterprise gets the same results as SAS, but about 50x faster.
Steve and Joe's slides and video of the presentation, Start Small Before Going Big, will be available on the Strata website in due course.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...