Benchmarking bigglm

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Joseph Rickert

In a recent blog post, David Smith reported on a talk that
Steve Yun and I gave at STRATA in NYC about building and benchmarking Poisson
GLM models on various platforms. The results presented showed that the rxGlm
function from Revolution Analytics’ RevoScaleR package running on a five node
cluster outperformed a Map Reduce/ Hadoop implementation as well as an
implementation of legacy software running on a large server. An alert R user
posted the following comment on the blog:

As a poisson regression was used,
it would be nice to also see as a benchmark the computational speed when using
the biglm package in open source R? Just import your csv in sqlite and run
biglm to obtain your poisson regression. Biglm also loads in data in R in
chunks in order to update the model so that looks more similar to the
RevoScaleR setup then just running plain glm in R.

This seemed like a reasonable, simple enough experiment. So
we tried it. The benchmark results presented at STRATA were done on a 145
million record file, but as a first step, I thought that I would try it on
a 14 million record subset that I already had loaded on my PC, a quad core Dell,
with i7 processors and 8GB of RAM.  It
took almost an hour to build the SQLite data base:

# make a SQLite database out of the csv file
sqldf("attach AdataT2SQL as new")
file <- file.path(getwd(),"AdataT2.csv")
read.csv.sql(file, sql = "create table main.AT2_10Pct as select * from file",dbname = "AdataT2SQL",header = TRUE)

. . . and then just a couple of lines of code to set up the
connections and run the model.

# Set up the data base cinnections
tablename <- "main.AT2_10Pct"
DB <- file.path(getwd(),"AdataT2SQL")
conn <- dbConnect(dbDriver("SQLite"),dbname=DB)
modelvars <- all.vars(formula)
query <- paste("select ", paste(modelvars, collapse = ", ")," from ", tablename)
# Run bigglm
system.time(model <- bigglm (formula = formula,
data = dbGetQuery(conn,query),family = poisson(),chunksize=10000,maxit=10))

Unfortunately, the model didn’t run to completion.  The error messages returned were of the form:

#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
#contrasts can be applied only to factors with 2 or more levels
#In addition: There were 50 or more warnings (use warnings() to see the first 50)
#Timing stopped at: 186.39 80.81 470.33
#1: In model.matrix.default(tt, mf) : variable 'V1' converted to a factor
#2: In model.matrix.default(tt, mf) : variable 'V2' converted to a factor

This error suggests that while chunking through the data
bigglm came across a variable that should be converted into a factor. But,
since there was only value for the variable in the chunk that was in memory bigglm threw
an error.

In general, factors present a significant challenge for external
memory algorithms.  Not only might an algorithm
fail to create factor variables, even when the algorithm runs there may be
unanticipated consequences that cause big trouble downstream. For example,
variations in text can cause attempts at automatic factor conversion to make
several versions of the same variable.  This,
in turn, may make it impossible to merge files, or cause an attempt to predict
results on a hold out data set to fail because the factor levels are different.
Even more insidiously, when hundreds of variables are involved in a model, an analyst
might not notice a few bogus factor levels.

bigglm does not provide a mechanism for setting factor
levels on the fly. In my opinion, far from being a fault, this was an
intelligent design choice. rxGlm, RevoScaleR’s function for building GLM models,
does provide some capability to work with factors on the fly. But, this is not recommended practice — too
many things can go wrong.  The
recommended way to do things is to use RevoScaleR’s rxFactors function on data stored
in RevoScaleR native .XDF file.  rxFactors
provides the user with very fine control of factor variables. Factor levels can
be set, sorted, created and merged.

The analogous course of action with bigglm would be to set
up the factor variables properly in the data base. Whenever, I have database
problems, my go to guy is my colleague Steve Weller. Steve loaded the data into
a MySQL database installed on a quad-core PC with 8 GB of RAM running Windows 2008
Server R2 Standard. He manually added new indicator variables to the database
corresponding to the factor levels in the original model, and built a model
that was almost statistically equivalent to the original model (we never quite
got the contrasts right) but good enough to benchmark.  It took bigglm about 27 minutes to run
working off the MySQL database. By comparison, rxGlm completed in less than a
minute on Steve’s test machine. We have not yet tried to run bigglm on the entire
145 million record dataset.  It would be
nice to know if bigglm scales linearly with the number of records.  If it does, that would bring bigglm in at
about 4.5 hours to process the entire data set, considerably longer than the 54
minutes it took to process the large data set with RevoScaleR on my PC.

It would be nice to hear from R users who have built bigglm models
on large data sets. Unfortunately, I cannot make the proprietary data set used
in the benchmark available. However, it should not be too difficult to find a suitable
publicly-available substitute. Our benchmark data set had 145,814,000 rows and
139 variables. These included integer, numeric, character and factor variables.
There were 40 independent variables in the original model. If you try, be
prepared to spend a little time on the project. It is not likely to be as easy
as the beguiling phrase in the comment to the blog post (“Just import your csv in sqlite and run biglm…”) would indicate.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)