Recently I’ve been doing a lot of work with predictive models using logistic regression. Logistic regression is great for determing probable outcomes of a independent binary target variable. R is a great tool for accomplishing this task. Often times I will use the base function glm to develop a model. Yet there are times, due to the hardware or software memory restrictions, that the usual glm function is not enough get the job done.

A great alternative to performing usual logistic regression analyses on big data is using the biglm package. Biglm performs the same regression optimization but processes the data in “chunks” at a time. This allows R to only perform calculations on smaller data sets without the need for large memory allocations to the computer. Biglm also has an interesting option that it not only can perform calculations on imported dataframes and text files but also database connectivity. This is where the helpful package RODBC comes in to the aid.

I have be looking all over the R support lists and blogs in hopes of finding a good tutorial using biglm and RODBC. I was not successful yet I was able to find out how to perform this myself.

The first step is to establish an ODBC source to a database. In this example I am using a Windows OS environment and connecting to a MS SQL Server. An odbc source must first be setup on the computer. This is usually done in the Windows Control Panel. Once that is done then RODBC can be used to establish a connection. My example is an odbc data source name called “sqlserver”.

library(RODBC)

myconn <- odbcConnect(sqlserver)

Now an ODBC connection object is established. Queries can now be submitted to the SQL Server via the sqlQuery function which is what we will use as the data source. SQL scripts can be the typical select statements.

sqlqry <- “select myvars, targetvar from mytable”

Next is to use the bigglm function to perform the logistic regression.

library(biglm)

fit <- bigglm(targetvar ~ myvars, data=sqlQuery(myconn, sqlqry), family=binomial(), chunksize=100, maxit=10)

summary(fit)

The data is being pulled from the SQL Server via the sqlQuery function from the RODBC package. The bigglm will recognize the sqlQuery data as a dataframe. The chunksize specifies the number of lines to process at any time. The maxit value specifies the maximum number of Fisher scoring iterations.

*Related*

R-bloggers.com offers

**daily e-mail updates** about

R news and

tutorials on topics such as:

Data science,

Big Data, R jobs, visualization (

ggplot2,

Boxplots,

maps,

animation), programming (

RStudio,

Sweave,

LaTeX,

SQL,

Eclipse,

git,

hadoop,

Web Scraping) statistics (

regression,

PCA,

time series,

trading) and more...

If you got this far, why not

__subscribe for updates__ from the site? Choose your flavor:

e-mail,

twitter,

RSS, or

facebook...