by Joseph Rickert
In a recent blog post, Revolution's Thomas Dinsmore announced stepwise regression for big data as a new feature of Revolution R Enterprise 6.2 that is scheduled for general availability later this month. Today, I would like to provide a simple example of doing stepwise regression with rxLinMod() (the RevoScaleR analog of lm()), using a 100,000 row subset of the Million Song data set. (Look here for a description of the variables contained in the file.) This file is big enough to make RevoScaleR functionality interesting, but small enough so that it can also be processed with lm() and step().
The following code runs a stepwise regression with RevoScaleR's rxLinMod() and rxStepControl() functions.
# STEPWISE REGRESSION EXAMPLE WITH rxLinMod and rxStepControl # # Access the data MSdir <- "C:/Users/Joseph/Documents/DATA/Million Song/MillionSong _XDF" fileName <- "MS_ABVZ.xdf" songs <- file.path(MSdir,fileName) # Look at a summary of the data rxGetInfo(songs,getVarInfo=TRUE) # Specify the linear model form <- formula(duration ~ artist.hotttnesss + artist.familiarity + track.7digitalid + release.7digitalid + year + mode + mode.confidence + key + key.confidence + time.signature + time.signature.confidence + start.of.fade.out + end.of.fade.in + tempo + loudness) # Run the linear model system.time(mod <- rxLinMod(formula = form, data = songs, blocksPerRead=10000)) summary(mod) # Look at a summary of the model # Specify the scope for the stepwise regression scope <- list( lower = ~ loudness, upper = ~ artist.hotttnesss + artist.familiarity + track.7digitalid + release.7digitalid + year + mode + mode.confidence + key + key.confidence + time.signature + time.signature.confidence + start.of.fade.out + end.of.fade.in + tempo + loudness) # Set up the variable selection parameter varsel <- rxStepControl(method = "stepwise", scope = scope) # Run the stepwise regression system.time(rxlm.step <- rxLinMod(form, data = songs, blocksPerRead=100000, variableSelection = varsel, verbose = 1, dropMain = FALSE, coefLabelStyle = "R"))
Notice that the output from the function rxStepControl() is used to set the variableSelection parameter of rxLinMod().
The RevoScaleR code is very similar to code one would write using lm() and step():
# Code to turn file into a data frame and run with lm and step # Read the data from a .xdf file into a data frame MSdf <- rxXdfToDataFrame(songs, maxRowsByCols=4000000) dim(MSdf) # system.time(rlm.mod <- lm(form, data = MSdf)) summary(rlm.mod) system.time(rlm.step <- step(rlm.mod, direction = "both", scope = scope, trace = 1)) #user system elapsed 38.56 4.12 17.89
The output from the RevoScaleR stepwise regression is included in the file Output (download Output) and is also similar to what is produced by lm() and step(). Notice, however, that it took step() nearly 18 seconds to run while the entire stepwise regression only took 0.16 seconds to run with rxLinMod(). We expect that, in general, computation time for rxLinMod()with rxStepControl() increase linearly with the number of observations.