A great alternative to performing usual logistic regression analyses on big data is using the biglm package. Biglm performs the same regression optimization but processes the data in "chunks" at a time. This allows R to only perform calculations on smaller data sets without the need for large memory allocations to the computer. Biglm also has an interesting option that it not only can perform calculations on imported dataframes and text files but also database connectivity. This is where the helpful package RODBC comes in to the aid.
I have be looking all over the R support lists and blogs in hopes of finding a good tutorial using biglm and RODBC. I was not successful yet I was able to find out how to perform this myself.
The first step is to establish an ODBC source to a database. In this example I am using a Windows OS environment and connecting to a MS SQL Server. An odbc source must first be setup on the computer. This is usually done in the Windows Control Panel. Once that is done then RODBC can be used to establish a connection. My example is an odbc data source name called "sqlserver".
myconn <- odbcConnect(sqlserver)
Now an ODBC connection object is established. Queries can now be submitted to the SQL Server via the sqlQuery function which is what we will use as the data source. SQL scripts can be the typical select statements.
sqlqry <- "select myvars, targetvar from mytable"
Next is to use the bigglm function to perform the logistic regression.
fit <- bigglm(targetvar ~ myvars, data=sqlQuery(myconn, sqlqry), family=binomial(), chunksize=100, maxit=10)
The data is being pulled from the SQL Server via the sqlQuery function from the RODBC package. The bigglm will recognize the sqlQuery data as a dataframe. The chunksize specifies the number of lines to process at any time. The maxit value specifies the maximum number of Fisher scoring iterations.
This big data method is not difficult once the SQL Server connection is setup. You will notice that R will not have the memory limitations while performing logistic regressions. For more information on regression modeling I recommend getting Applied logistic regression (Wiley Series in probability and statistics). This book has been an essential for statistics and applied regression practitioners. Other good resources are Logistic Regression Models (Chapman & Hall/CRC Texts in Statistical Science) and Modern Regression Techniques Using R: A Practical Guide.