Stepwise Regression for Big Data with RevoScaleR

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

In a recent blog post, Revolution's Thomas Dinsmore announced stepwise regression for big data as a new feature of Revolution R Enterprise 6.2 that is scheduled for general availability later this month. Today, I would like to provide a simple example of doing stepwise regression with rxLinMod() (the RevoScaleR analog of lm()),  using a 100,000 row subset of the Million Song data set. (Look here for a description of the variables contained in the file.) This file is big enough to make RevoScaleR functionality interesting, but small enough so that it can also be processed with lm() and step().

The following code runs a stepwise regression with RevoScaleR's rxLinMod() and rxStepControl() functions.

# STEPWISE REGRESSION EXAMPLE WITH rxLinMod and rxStepControl
#
# Access the data
MSdir <- "C:/Users/Joseph/Documents/DATA/Million Song/MillionSong _XDF"
fileName <- "MS_ABVZ.xdf"
songs <- file.path(MSdir,fileName)
# Look at a summary of the data
rxGetInfo(songs,getVarInfo=TRUE)
# Specify the linear model
form <- formula(duration ~ artist.hotttnesss +
             artist.familiarity + track.7digitalid + release.7digitalid  +
	     year + mode + mode.confidence + key + key.confidence + time.signature + 
	     time.signature.confidence + start.of.fade.out + end.of.fade.in + tempo + loudness)
# Run the linear model				
system.time(mod <- rxLinMod(formula = form,
	                     data = songs,
			     blocksPerRead=10000))
 
summary(mod)             # Look at a summary of the model
# Specify the scope for the stepwise regression
scope <- list(
    lower = ~ loudness,
    upper = ~ artist.hotttnesss + artist.familiarity + track.7digitalid + 
	          release.7digitalid  + year + mode + mode.confidence + key + 
			  key.confidence + time.signature + time.signature.confidence + 
			  start.of.fade.out + end.of.fade.in + tempo + loudness)
 
# Set up the variable selection parameter
varsel <- rxStepControl(method = "stepwise", scope = scope)
# Run the stepwise regression
system.time(rxlm.step <- rxLinMod(form, data = songs,
 	              blocksPerRead=100000,
	              variableSelection = varsel,
                      verbose = 1, 
		      dropMain = FALSE, 
		      coefLabelStyle = "R"))

Notice that the output from the function rxStepControl() is used to set the variableSelection parameter of rxLinMod().

The RevoScaleR code is very similar to code one would write using lm() and step():

# Code to turn file into a data frame and run with lm and step
# Read the data from a .xdf file into a data frame
MSdf <- rxXdfToDataFrame(songs, maxRowsByCols=4000000)
dim(MSdf)
# 
system.time(rlm.mod <- lm(form, data = MSdf))
summary(rlm.mod)
 
system.time(rlm.step <- step(rlm.mod, direction = "both", scope = scope, trace = 1))
#user  system elapsed 
  38.56    4.12   17.89 

The output from the RevoScaleR stepwise regression is included in the file Output (download Output) and is also similar to what is produced by lm() and step(). Notice, however, that it took step() nearly 18 seconds to run while the entire stepwise regression only took 0.16 seconds to run with rxLinMod(). We expect that, in general, computation time for rxLinMod()with rxStepControl() increase linearly with the number of observations.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)