**Software for Exploratory Data Analysis and Statistical Modelling**, and kindly contributed to R-bloggers)

When we have a set of data with a small number of variables we can easily use a manual approach to identifying a good set of variables and the form they take in our statistical model. In other situations we may have a large number of potentially important variables and it soon becomes a time consuming effort to follow a manual variable selection process. In this case we may consider using automatic subset selection tools to remove some of the burden of the task.

It should be noted that there is some disagreement about whether it is desirable to use an automated method but this post will focus on the mechanics of doing it rather than the debate about whether to be doing it at all.

The **R** package **leaps** has a function **regsubsets** that can be used for best subsets, forward selection and backwards elimination depending on which approach is considered most appropriate for the application under consideration.

In previous post we considered using data on CPU performance to illustrate the variable selection process. We load the required packages:

> require(leaps) > require(MASS)

First up we consider selecting the *best* subset of a particular size, say four variables for illustrative purposes (**nvmax** argument), and we specify the largest possible model which in this example has six variables:

regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus, nvmax = 4)

A summary for the output from this function is shown here:

> summary(reg1) Subset selection object Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus, nvmax = 4) 6 Variables (and intercept) Forced in Forced out syct FALSE FALSE mmin FALSE FALSE mmax FALSE FALSE cach FALSE FALSE chmin FALSE FALSE chmax FALSE FALSE 1 subsets of each size up to 4 Selection Algorithm: exhaustive syct mmin mmax cach chmin chmax 1 ( 1 ) " " " " "*" " " " " " " 2 ( 1 ) " " " " "*" "*" " " " " 3 ( 1 ) " " "*" "*" " " " " "*" 4 ( 1 ) " " "*" "*" "*" " " "*"

The function **regsubsets** identifies the variables **mmin**, **mmax**, **cach** and **chmax** as the *best* four.

Alternatively we could perform a backwards elimination and the function will indicate the *best* subset of a particular size, from one to six variables in this example:

> reg2 = regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus, method = "backward") > summary(reg2) Subset selection object Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus, method = "backward") 6 Variables (and intercept) Forced in Forced out syct FALSE FALSE mmin FALSE FALSE mmax FALSE FALSE cach FALSE FALSE chmin FALSE FALSE chmax FALSE FALSE 1 subsets of each size up to 6 Selection Algorithm: backward syct mmin mmax cach chmin chmax 1 ( 1 ) " " "*" " " " " " " " " 2 ( 1 ) " " "*" " " " " " " "*" 3 ( 1 ) " " "*" "*" " " " " "*" 4 ( 1 ) " " "*" "*" "*" " " "*" 5 ( 1 ) "*" "*" "*" "*" " " "*" 6 ( 1 ) "*" "*" "*" "*" "*" "*"

The subset of four variables is the same for this example as the *best* subsets approach. The third approach if forward selection:

> reg3 = regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus, method = "backward") > summary(reg3) Subset selection object Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus, method = "backward") 6 Variables (and intercept) Forced in Forced out syct FALSE FALSE mmin FALSE FALSE mmax FALSE FALSE cach FALSE FALSE chmin FALSE FALSE chmax FALSE FALSE 1 subsets of each size up to 6 Selection Algorithm: backward syct mmin mmax cach chmin chmax 1 ( 1 ) " " "*" " " " " " " " " 2 ( 1 ) " " "*" " " " " " " "*" 3 ( 1 ) " " "*" "*" " " " " "*" 4 ( 1 ) " " "*" "*" "*" " " "*" 5 ( 1 ) "*" "*" "*" "*" " " "*" 6 ( 1 ) "*" "*" "*" "*" "*" "*"

For this data set, as there are only six variables, we do not see divergence between the subsets chosen by the different methods.

**leave a comment**for the author, please follow the link and comment on their blog:

**Software for Exploratory Data Analysis and Statistical Modelling**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...