By Joseph Rickert
In a recent blog post, David Smith reported on a talk that Steve Yun and I gave at STRATA in NYC about building and benchmarking Poisson GLM models on various platforms. The results presented showed that the rxGlm function from Revolution Analytics’ RevoScaleR package running on a five node cluster outperformed a Map Reduce/ Hadoop implementation as well as an implementation of legacy software running on a large server. An alert R user posted the following comment on the blog:
As a poisson regression was used, it would be nice to also see as a benchmark the computational speed when using the biglm package in open source R? Just import your csv in sqlite and run biglm to obtain your poisson regression. Biglm also loads in data in R in chunks in order to update the model so that looks more similar to the RevoScaleR setup then just running plain glm in R.
This seemed like a reasonable, simple enough experiment. So we tried it. The benchmark results presented at STRATA were done on a 145 million record file, but as a first step, I thought that I would try it on a 14 million record subset that I already had loaded on my PC, a quad core Dell, with i7 processors and 8GB of RAM. It took almost an hour to build the SQLite data base:
# make a SQLite database out of the csv file library(sqldf) sqldf("attach AdataT2SQL as new") file <- file.path(getwd(),"AdataT2.csv") read.csv.sql(file, sql = "create table main.AT2_10Pct as select * from file",dbname = "AdataT2SQL",header = TRUE)
. . . and then just a couple of lines of code to set up the connections and run the model.
# BIGGLM library(biglm) #-------------------- # Set up the data base cinnections tablename <- "main.AT2_10Pct" DB <- file.path(getwd(),"AdataT2SQL") conn <- dbConnect(dbDriver("SQLite"),dbname=DB) modelvars <- all.vars(formula) query <- paste("select ", paste(modelvars, collapse = ", ")," from ", tablename) #-------------------- # Run bigglm gc() system.time(model <- bigglm (formula = formula, data = dbGetQuery(conn,query),family = poisson(),chunksize=10000,maxit=10))
Unfortunately, the model didn’t run to completion. The error messages returned were of the form:
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : #contrasts can be applied only to factors with 2 or more levels #In addition: There were 50 or more warnings (use warnings() to see the first 50) #Timing stopped at: 186.39 80.81 470.33 warnings() #1: In model.matrix.default(tt, mf) : variable 'V1' converted to a factor #2: In model.matrix.default(tt, mf) : variable 'V2' converted to a factor
This error suggests that while chunking through the data bigglm came across a variable that should be converted into a factor. But, since there was only value for the variable in the chunk that was in memory bigglm threw an error.
In general, factors present a significant challenge for external memory algorithms. Not only might an algorithm fail to create factor variables, even when the algorithm runs there may be unanticipated consequences that cause big trouble downstream. For example, variations in text can cause attempts at automatic factor conversion to make several versions of the same variable. This, in turn, may make it impossible to merge files, or cause an attempt to predict results on a hold out data set to fail because the factor levels are different. Even more insidiously, when hundreds of variables are involved in a model, an analyst might not notice a few bogus factor levels.
bigglm does not provide a mechanism for setting factor levels on the fly. In my opinion, far from being a fault, this was an intelligent design choice. rxGlm, RevoScaleR’s function for building GLM models, does provide some capability to work with factors on the fly. But, this is not recommended practice — too many things can go wrong. The recommended way to do things is to use RevoScaleR’s rxFactors function on data stored in RevoScaleR native .XDF file. rxFactors provides the user with very fine control of factor variables. Factor levels can be set, sorted, created and merged.
The analogous course of action with bigglm would be to set up the factor variables properly in the data base. Whenever, I have database problems, my go to guy is my colleague Steve Weller. Steve loaded the data into a MySQL database installed on a quad-core PC with 8 GB of RAM running Windows 2008 Server R2 Standard. He manually added new indicator variables to the database corresponding to the factor levels in the original model, and built a model that was almost statistically equivalent to the original model (we never quite got the contrasts right) but good enough to benchmark. It took bigglm about 27 minutes to run working off the MySQL database. By comparison, rxGlm completed in less than a minute on Steve’s test machine. We have not yet tried to run bigglm on the entire 145 million record dataset. It would be nice to know if bigglm scales linearly with the number of records. If it does, that would bring bigglm in at about 4.5 hours to process the entire data set, considerably longer than the 54 minutes it took to process the large data set with RevoScaleR on my PC.
It would be nice to hear from R users who have built bigglm models on large data sets. Unfortunately, I cannot make the proprietary data set used in the benchmark available. However, it should not be too difficult to find a suitable publicly-available substitute. Our benchmark data set had 145,814,000 rows and 139 variables. These included integer, numeric, character and factor variables. There were 40 independent variables in the original model. If you try, be prepared to spend a little time on the project. It is not likely to be as easy as the beguiling phrase in the comment to the blog post (“Just import your csv in sqlite and run biglm…”) would indicate.