Big Data Logistic Regression with R and ODBC

December 7, 2010
By

(This article was first published on Maximize Productivity with Industrial Engineer and Operations Research Tools, and kindly contributed to R-bloggers)

Recently I've been doing a lot of work with predictive models using logistic regression.  Logistic regression is great for determing probable outcomes of a independent binary target variable.  R is a great tool for accomplishing this task.  Often times I will use the base function glm to develop a model.  Yet there are times, due to the hardware or software memory restrictions, that the usual glm function is not enough get the job done.

A great alternative to performing usual logistic regression analyses on big data is using the biglm package.  Biglm performs the same regression optimization but processes the data in "chunks" at a time.  This allows R to only perform calculations on smaller data sets without the need for large memory allocations to the computer.  Biglm also has an interesting option that it not only can perform calculations on imported dataframes and text files but also database connectivity.  This is where the helpful package RODBC comes in to the aid.

I have be looking all over the R support lists and blogs in hopes of finding a good tutorial using biglm and RODBC.  I was not successful yet I was able to find out how to perform this myself.




The first step is to establish an ODBC source to a database.  In this example I am using a Windows OS environment and connecting to a MS SQL Server.  An odbc source must first be setup on the computer.  This is usually done in the Windows Control Panel.  Once that is done then RODBC can be used to establish a connection.  My example is an odbc data source name called "sqlserver".

library(RODBC)
myconn <- odbcConnect(sqlserver)

Now an ODBC connection object is established.  Queries can now be submitted to the SQL Server via the sqlQuery function which is what we will use as the data source.  SQL scripts can be the typical select statements.


sqlqry <- "select myvars, targetvar from mytable"

Next is to use the bigglm function to perform the logistic regression.


library(biglm)
fit <- bigglm(targetvar ~ myvars, data=sqlQuery(myconn, sqlqry), family=binomial(), chunksize=100, maxit=10)
summary(fit)


The data is being pulled from the SQL Server via the sqlQuery function from the RODBC package.  The bigglm will recognize the sqlQuery data as a dataframe.  The chunksize specifies the number of lines to process at any time.  The maxit value specifies the maximum number of Fisher scoring iterations.  

This big data method is not difficult once the SQL Server connection is setup.  You will notice that R will not have the memory limitations while performing logistic regressions.  For more information on regression modeling I recommend getting Applied logistic regression (Wiley Series in probability and statistics).  This book has been an essential for statistics and applied regression practitioners.  Other good resources are Logistic Regression Models (Chapman & Hall/CRC Texts in Statistical Science) and Modern Regression Techniques Using R: A Practical Guide.



To leave a comment for the author, please follow the link and comment on his blog: Maximize Productivity with Industrial Engineer and Operations Research Tools.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.