Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at allrelevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.
The caret package provides a very flexible framework for the analysis as we shall see, but first we set up the artificial test data set as in the previous article.
## Featurebc.R  Compare Boruta and caret feature selection ## Copyright © 2010 Allan Engelhardt (http://www.cybaea.net/) run.name < "featurebc" library("caret") ## Load early to get the warnings out of the way: library("randomForest") library("ipred") library("gbm") set.seed(1) ## Set up artificial test data for our analysis n.var < 20 n.obs < 200 x < data.frame(V = matrix(rnorm(n.var*n.obs), n.obs, n.var)) n.dep < floor(n.var/5) cat( "Number of dependent variables is", n.dep, "\n") m < diag(n.dep:1) ## These are our four test targets y.1 < factor( ifelse( x$V.1 >= 0, 'A', 'B' ) ) y.2 < ifelse( rowSums(as.matrix(x[, 1:n.dep]) %*% m) >= 0, "A", "B" ) y.2 < factor(y.2) y.3 < factor(rowSums(x[, 1:n.dep] >= 0)) y.4 < factor(rowSums(x[, 1:n.dep] >= 0) %% 2)
The flexibility of the caret package is to a large extent implemented by using control objects. Here we specify to use the randomForest
classification algorithm (which is also what Boruta uses) and if the multicore package is available then we use that for extra perfomance (you can also use MPI etc – see the documentation):
control < rfeControl(functions = rfFuncs, method = "boot", verbose = FALSE, returnResamp = "final", number = 50) if ( require("multicore", quietly = TRUE, warn.conflicts = FALSE) ) { control$workers < multicore:::detectCores() control$computeFunction < mclapply control$computeArgs < list(mc.preschedule = FALSE, mc.set.seed = FALSE) }
We will consider from one to six features (using the sizes
variable) and then we simply let it lose:
sizes < 1:6 ## Use randomForest for prediction profile.1 < rfe(x, y.1, sizes = sizes, rfeControl = control) cat( "rf : Profile 1 predictors:", predictors(profile.1), fill = TRUE ) profile.2 < rfe(x, y.2, sizes = sizes, rfeControl = control) cat( "rf : Profile 2 predictors:", predictors(profile.2), fill = TRUE ) profile.3 < rfe(x, y.3, sizes = sizes, rfeControl = control) cat( "rf : Profile 3 predictors:", predictors(profile.3), fill = TRUE ) profile.4 < rfe(x, y.4, sizes = sizes, rfeControl = control) cat( "rf : Profile 4 predictors:", predictors(profile.4), fill = TRUE )
The results are:
rf : Profile 1 predictors: V.1 V.16 V.6 rf : Profile 2 predictors: V.1 V.2 rf : Profile 3 predictors: V.4 V.1 V.2 rf : Profile 4 predictors: V.10 V.11 V.7
If you recall the feature selection with Boruta article, then the results there were
 Profile 1:
V.1, (V.16, V.17)
 Profile 2:
V.1, V.2, V,3, (V.8, V.9, V.4)
 Profile 3:
V.1, V.4, V.3, V.2, (V.7, V.6)
 Profile 4:
V.10, (V.11, V.13)
To show the flexibility of caret, we can run the analysis with another of the builtin classifiers:
## Use ipred::ipredbag for prediction control$functions < treebagFuncs profile.1 < rfe(x, y.1, sizes = sizes, rfeControl = control) cat( "treebag: Profile 1 predictors:", predictors(profile.1), fill = TRUE ) profile.2 < rfe(x, y.2, sizes = sizes, rfeControl = control) cat( "treebag: Profile 2 predictors:", predictors(profile.2), fill = TRUE ) profile.3 < rfe(x, y.3, sizes = sizes, rfeControl = control) cat( "treebag: Profile 3 predictors:", predictors(profile.3), fill = TRUE ) profile.4 < rfe(x, y.4, sizes = sizes, rfeControl = control) cat( "treebag: Profile 4 predictors:", predictors(profile.4), fill = TRUE )
This gives:
treebag: Profile 1 predictors: V.1 V.16 treebag: Profile 2 predictors: V.2 V.1 treebag: Profile 3 predictors: V.1 V.3 V.2 treebag: Profile 4 predictors: V.10 V.11 V.1 V.7 V.13
And of course, if you have your own favourite model class that is not already implemented, then you can easily do that yourself. We like gbm
from the package of the same name, which is kind of silly to use here because it provides variable importance automatically as part of the fitting process, but may still be useful. It needs numeric predictors so we do:
## Use gbm for prediction y.1 < as.numeric(y.1)1 y.2 < as.numeric(y.2)1 y.3 < as.numeric(y.3)1 y.4 < as.numeric(y.4)1 gbmFuncs < treebagFuncs gbmFuncs$fit < function (x, y, first, last, ...) { library("gbm") n.levels < length(unique(y)) if ( n.levels == 2 ) { distribution = "bernoulli" } else { distribution = "gaussian" } gbm.fit(x, y, distribution = distribution, ...) } gbmFuncs$pred < function (object, x) { n.trees < suppressWarnings(gbm.perf(object, plot.it = FALSE, method = "OOB")) if ( n.trees <= 0 ) n.trees < object$n.trees predict(object, x, n.trees = n.trees, type = "link") } control$functions < gbmFuncs n.trees < 1e2 # Default value for gbm is 100 profile.1 < rfe(x, y.1, sizes = sizes, rfeControl = control, verbose = FALSE, n.trees = n.trees) cat( "gbm : Profile 1 predictors:", predictors(profile.1), fill = TRUE ) profile.2 < rfe(x, y.2, sizes = sizes, rfeControl = control, verbose = FALSE, n.trees = n.trees) cat( "gbm : Profile 2 predictors:", predictors(profile.2), fill = TRUE ) profile.3 < rfe(x, y.3, sizes = sizes, rfeControl = control, verbose = FALSE, n.trees = n.trees) cat( "gbm : Profile 3 predictors:", predictors(profile.3), fill = TRUE ) profile.4 < rfe(x, y.4, sizes = sizes, rfeControl = control, verbose = FALSE, n.trees = n.trees) cat( "gbm : Profile 4 predictors:", predictors(profile.4), fill = TRUE )
And we get the results below:
gbm : Profile 1 predictors: V.1 V.10 V.11 V.12 V.13 gbm : Profile 2 predictors: V.1 V.2 gbm : Profile 3 predictors: V.4 V.1 V.2 V.3 V.7 gbm : Profile 4 predictors: V.11 V.10 V.1 V.6 V.7 V.18
It is all good and very flexible, for sure, but I can’t really say it is better than the Boruta approach for these simple examples.
Jump to comments.
You may also like these posts:

Benchmarking feature selection with Boruta and caret
Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with.

Feature selection: Allrelevant selection with the Boruta package
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimaloptimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the allrelevant feature selection which identifies all variables that are in some circumstances relevant for the classification. In this article we take a first look at the problem of allrelevant feature selection using the Boruta package by Miron B. Kursa and Witold R. Rudnicki. This package is developed fo…

R code for Chapter 1 of NonLife Insurance Pricing with GLM
Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is NonLife Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing some work in this area recently. Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to R , the statistical computing and analysis platform. This is part of a series of posts containing elements of the R code.

R code for Chapter 2 of NonLife Insurance Pricing with GLM
We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as NonLife Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK  US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article.

Area Plots with Intensity Coloring
I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the underappreciated read.fwf function.
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...