# Stepwise Regression for Big Data with RevoScaleR

April 11, 2013
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

In a recent blog post, Revolution's Thomas Dinsmore announced stepwise regression for big data as a new feature of Revolution R Enterprise 6.2 that is scheduled for general availability later this month. Today, I would like to provide a simple example of doing stepwise regression with rxLinMod() (the RevoScaleR analog of lm()),  using a 100,000 row subset of the Million Song data set. (Look here for a description of the variables contained in the file.) This file is big enough to make RevoScaleR functionality interesting, but small enough so that it can also be processed with lm() and step().

The following code runs a stepwise regression with RevoScaleR's rxLinMod() and rxStepControl() functions.

# STEPWISE REGRESSION EXAMPLE WITH rxLinMod and rxStepControl
#
# Access the data
MSdir <- "C:/Users/Joseph/Documents/DATA/Million Song/MillionSong _XDF"
fileName <- "MS_ABVZ.xdf"
songs <- file.path(MSdir,fileName)
# Look at a summary of the data
rxGetInfo(songs,getVarInfo=TRUE)
# Specify the linear model
form <- formula(duration ~ artist.hotttnesss +
artist.familiarity + track.7digitalid + release.7digitalid  +
year + mode + mode.confidence + key + key.confidence + time.signature +
# Run the linear model
system.time(mod <- rxLinMod(formula = form,
data = songs,

summary(mod)             # Look at a summary of the model
# Specify the scope for the stepwise regression
scope <- list(
lower = ~ loudness,
upper = ~ artist.hotttnesss + artist.familiarity + track.7digitalid +
release.7digitalid  + year + mode + mode.confidence + key +
key.confidence + time.signature + time.signature.confidence +

# Set up the variable selection parameter
varsel <- rxStepControl(method = "stepwise", scope = scope)
# Run the stepwise regression
system.time(rxlm.step <- rxLinMod(form, data = songs,
variableSelection = varsel,
verbose = 1,
dropMain = FALSE,
coefLabelStyle = "R"))

Created by Pretty R at inside-R.org

Notice that the output from the function rxStepControl() is used to set the variableSelection parameter of rxLinMod().

The RevoScaleR code is very similar to code one would write using lm() and step():

# Code to turn file into a data frame and run with lm and step
# Read the data from a .xdf file into a data frame
MSdf <- rxXdfToDataFrame(songs, maxRowsByCols=4000000)
dim(MSdf)
#
system.time(rlm.mod <- lm(form, data = MSdf))
summary(rlm.mod)

system.time(rlm.step <- step(rlm.mod, direction = "both", scope = scope, trace = 1))
#user  system elapsed
38.56    4.12   17.89 

Created by Pretty R at inside-R.org

The output from the RevoScaleR stepwise regression is included in the file Output (download Output) and is also similar to what is produced by lm() and step(). Notice, however, that it took step() nearly 18 seconds to run while the entire stepwise regression only took 0.16 seconds to run with rxLinMod(). We expect that, in general, computation time for rxLinMod()with rxStepControl() increase linearly with the number of observations.