Variable selection using automatic methods

May 22, 2010
By

(This article was first published on Software for Exploratory Data Analysis and Statistical Modelling, and kindly contributed to R-bloggers)

When we have a set of data with a small number of variables we can easily use a manual approach to identifying a good set of variables and the form they take in our statistical model. In other situations we may have a large number of potentially important variables and it soon becomes a time consuming effort to follow a manual variable selection process. In this case we may consider using automatic subset selection tools to remove some of the burden of the task.

It should be noted that there is some disagreement about whether it is desirable to use an automated method but this post will focus on the mechanics of doing it rather than the debate about whether to be doing it at all.

The R package leaps has a function regsubsets that can be used for best subsets, forward selection and backwards elimination depending on which approach is considered most appropriate for the application under consideration.

In previous post we considered using data on CPU performance to illustrate the variable selection process. We load the required packages:

> require(leaps)
> require(MASS)

First up we consider selecting the best subset of a particular size, say four variables for illustrative purposes (nvmax argument), and we specify the largest possible model which in this example has six variables:

regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, nvmax = 4)

A summary for the output from this function is shown here:

> summary(reg1)
Subset selection object
Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + 
    chmax, data = cpus, nvmax = 4)
6 Variables  (and intercept)
      Forced in Forced out
syct      FALSE      FALSE
mmin      FALSE      FALSE
mmax      FALSE      FALSE
cach      FALSE      FALSE
chmin     FALSE      FALSE
chmax     FALSE      FALSE
1 subsets of each size up to 4
Selection Algorithm: exhaustive
         syct mmin mmax cach chmin chmax
1  ( 1 ) " "  " "  "*"  " "  " "   " "  
2  ( 1 ) " "  " "  "*"  "*"  " "   " "  
3  ( 1 ) " "  "*"  "*"  " "  " "   "*"  
4  ( 1 ) " "  "*"  "*"  "*"  " "   "*"

The function regsubsets identifies the variables mmin, mmax, cach and chmax as the best four.

Alternatively we could perform a backwards elimination and the function will indicate the best subset of a particular size, from one to six variables in this example:

> reg2 = regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, method = "backward")
> summary(reg2)
Subset selection object
Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + 
    chmax, data = cpus, method = "backward")
6 Variables  (and intercept)
      Forced in Forced out
syct      FALSE      FALSE
mmin      FALSE      FALSE
mmax      FALSE      FALSE
cach      FALSE      FALSE
chmin     FALSE      FALSE
chmax     FALSE      FALSE
1 subsets of each size up to 6
Selection Algorithm: backward
         syct mmin mmax cach chmin chmax
1  ( 1 ) " "  "*"  " "  " "  " "   " "  
2  ( 1 ) " "  "*"  " "  " "  " "   "*"  
3  ( 1 ) " "  "*"  "*"  " "  " "   "*"  
4  ( 1 ) " "  "*"  "*"  "*"  " "   "*"  
5  ( 1 ) "*"  "*"  "*"  "*"  " "   "*"  
6  ( 1 ) "*"  "*"  "*"  "*"  "*"   "*"

The subset of four variables is the same for this example as the best subsets approach. The third approach if forward selection:

> reg3 = regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, method = "backward")
> summary(reg3)
Subset selection object
Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + 
    chmax, data = cpus, method = "backward")
6 Variables  (and intercept)
      Forced in Forced out
syct      FALSE      FALSE
mmin      FALSE      FALSE
mmax      FALSE      FALSE
cach      FALSE      FALSE
chmin     FALSE      FALSE
chmax     FALSE      FALSE
1 subsets of each size up to 6
Selection Algorithm: backward
         syct mmin mmax cach chmin chmax
1  ( 1 ) " "  "*"  " "  " "  " "   " "  
2  ( 1 ) " "  "*"  " "  " "  " "   "*"  
3  ( 1 ) " "  "*"  "*"  " "  " "   "*"  
4  ( 1 ) " "  "*"  "*"  "*"  " "   "*"  
5  ( 1 ) "*"  "*"  "*"  "*"  " "   "*"  
6  ( 1 ) "*"  "*"  "*"  "*"  "*"   "*"

For this data set, as there are only six variables, we do not see divergence between the subsets chosen by the different methods.

To leave a comment for the author, please follow the link and comment on his blog: Software for Exploratory Data Analysis and Statistical Modelling.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , , ,

Comments are closed.