**Revolutions**, and kindly contributed to R-bloggers)

*By Joseph Rickert*

In a recent blog post, David Smith reported on a talk that

Steve Yun and I gave at STRATA in NYC about building and benchmarking Poisson

GLM models on various platforms. The results presented showed that the rxGlm

function from Revolution Analytics’ RevoScaleR package running on a five node

cluster outperformed a Map Reduce/ Hadoop implementation as well as an

implementation of legacy software running on a large server. An alert R user

posted the following comment on the blog:

As a poisson regression was used,

it would be nice to also see as a benchmark the computational speed when using

the biglm package in open source R? Just import your csv in sqlite and run

biglm to obtain your poisson regression. Biglm also loads in data in R in

chunks in order to update the model so that looks more similar to the

RevoScaleR setup then just running plain glm in R.

This seemed like a reasonable, simple enough experiment. So

we tried it. The benchmark results presented at STRATA were done on a 145

million record file, but as a first step, I thought that I would try it on

a 14 million record subset that I already had loaded on my PC, a quad core Dell,

with i7 processors and 8GB of RAM. It

took almost an hour to build the SQLite data base:

. . . and then just a couple of lines of code to set up the

connections and run the model.

# BIGGLM library(biglm) #-------------------- # Set up the data base cinnections tablename <- "main.AT2_10Pct" DB <- file.path(getwd(),"AdataT2SQL") conn <- dbConnect(dbDriver("SQLite"),dbname=DB) modelvars <- all.vars(formula) query <- paste("select ", paste(modelvars, collapse = ", ")," from ", tablename) #-------------------- # Run bigglm gc() system.time(model <- bigglm (formula = formula, data = dbGetQuery(conn,query),family = poisson(),chunksize=10000,maxit=10))

Unfortunately, the model didn’t run to completion. The error messages returned were of the form:

#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : #contrasts can be applied only to factors with 2 or more levels #In addition: There were 50 or more warnings (use warnings() to see the first 50) #Timing stopped at: 186.39 80.81 470.33 warnings() #1: In model.matrix.default(tt, mf) : variable 'V1' converted to a factor #2: In model.matrix.default(tt, mf) : variable 'V2' converted to a factor

This error suggests that while chunking through the data

bigglm came across a variable that should be converted into a factor. But,

since there was only value for the variable in the chunk that was in memory bigglm threw

an error.

In general, factors present a significant challenge for external

memory algorithms. Not only might an algorithm

fail to create factor variables, even when the algorithm runs there may be

unanticipated consequences that cause big trouble downstream. For example,

variations in text can cause attempts at automatic factor conversion to make

several versions of the same variable. This,

in turn, may make it impossible to merge files, or cause an attempt to predict

results on a hold out data set to fail because the factor levels are different.

Even more insidiously, when hundreds of variables are involved in a model, an analyst

might not notice a few bogus factor levels.

bigglm does not provide a mechanism for setting factor

levels on the fly. In my opinion, far from being a fault, this was an

intelligent design choice. rxGlm, RevoScaleR’s function for building GLM models,

does provide some capability to work with factors on the fly. But, this is not recommended practice — too

many things can go wrong. The

recommended way to do things is to use RevoScaleR’s rxFactors function on data stored

in RevoScaleR native .XDF file. rxFactors

provides the user with very fine control of factor variables. Factor levels can

be set, sorted, created and merged.

The analogous course of action with bigglm would be to set

up the factor variables properly in the data base. Whenever, I have database

problems, my go to guy is my colleague Steve Weller. Steve loaded the data into

a MySQL database installed on a quad-core PC with 8 GB of RAM running Windows 2008

Server R2 Standard. He manually added new indicator variables to the database

corresponding to the factor levels in the original model, and built a model

that was almost statistically equivalent to the original model (we never quite

got the contrasts right) but good enough to benchmark. It took bigglm about 27 minutes to run

working off the MySQL database. By comparison, rxGlm completed in less than a

minute on Steve’s test machine. We have not yet tried to run bigglm on the entire

145 million record dataset. It would be

nice to know if bigglm scales linearly with the number of records. If it does, that would bring bigglm in at

about 4.5 hours to process the entire data set, considerably longer than the 54

minutes it took to process the large data set with RevoScaleR on my PC.

It would be nice to hear from R users who have built bigglm models

on large data sets. Unfortunately, I cannot make the proprietary data set used

in the benchmark available. However, it should not be too difficult to find a suitable

publicly-available substitute. Our benchmark data set had 145,814,000 rows and

139 variables. These included integer, numeric, character and factor variables.

There were 40 independent variables in the original model. If you try, be

prepared to spend a little time on the project. It is not likely to be as easy

as the beguiling phrase in the comment to the blog post (*“Just import your csv in sqlite and run biglm…”*) would indicate.

**leave a comment**for the author, please follow the link and comment on his blog:

**Revolutions**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...