Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us.
Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches.
For our tests we will use an artificially constructed trivial data sets that the automated process should have no problems with (but we will be disappointed later on this expectation, as we will see). The data set has an equal number of normal and uniform random variables with mean 0 and variance 1 of which 20% are used for the target variable. There are 10 time as many observations as variables. We create a function to set this up:
make.data < function (n.var, m.rand = 5, m.obs = 10) { n.col < n.var * m.rand n.obs < n.col * m.obs * 2 x < data.frame(N = matrix(rnorm(n = n.col*n.obs), nrow = n.obs, ncol = n.col), U = matrix(runif(n = n.col*n.obs, min = sqrt(3), max = sqrt(3)), n.obs, n.col)) deps.n < 1:n.var deps.u < (1+n.col):(n.var+n.col) y < rowSums(as.matrix(x[, c(deps.n, deps.u)])) x < cbind(x, Y = factor(y >= 0, labels=c("N", "P"))) attr(x, "vars") < names(x)[c(deps.n, deps.u)] return(x) }
The Boruta package
Then we run a test using the Boruta package for different sizes:
#!/usr/bin/Rscript ## bench.R  benchmark Boruta package ## Copyright © 2010 Allan Engelhardt (http://www.cybaea.net/) run.name < "bench1" library("Boruta") set.seed(1) sizes < c(1:10, 10*(2:10), 100*(2:10), 1e3*(2:10)) n.sizes < length(sizes) bench < data.frame(n.vars = sizes, elapsed = NA, right = NA, wrong = NA) file.name < paste(run.name, "RData", sep = ".") for (n in 1:length(sizes)) { size < sizes[n] cat(sprintf("[%s] Size = %3d: ", as.character(Sys.time()), size)) tries < max(3, round(10/size, 0)) n.right < 0 n.wrong < 0 elapsed < 0 for (try in 1:tries) { cat(triestry, ".", sep = "") x < make.data(size) x.vars < attr(x, "vars") elapsed < elapsed + system.time({b < Boruta(x[,NCOL(x)], x[,NCOL(x)])} )["elapsed"] b.vars < names(b$finalDecision)[b$finalDecision!="Rejected"] n.right < n.right + length(intersect(b.vars, x.vars)) n.wrong < n.wrong + length(setdiff(b.vars, x.vars)) } elapsed < elapsed / tries cat(" Elapsed = ", round(elapsed, 0), " seconds\n", sep = "") n.right < n.right / tries n.wrong < n.wrong / tries bench[n, ] < c(size, elapsed, n.right, n.wrong) save(bench, file = file.name, ascii = FALSE, compress = FALSE) } print(bench)
As it turned out, our expectations for the size of data set we could handle were wildly optimistic and we killed the process at size 30. We add to the data set a field with the total number of variables in the x
data set and plot the results.
load(file = "bench1.RData") bench < na.omit(bench) bench$n.elem < bench$n.var^2 * 1e3 plot(elapsed ~ n.elem, data = bench, type = "b", main = "Feature selections with Boruta", sub = "Elapsed time versus number of data elements", log = "xy", xlab = "Elements in data set", ylab = "Elapsed time (seconds)")
A quick check using summary(lm(log(elapsed) ~ log(n.elem), data = bench))
shows us a linear scaling with the number of elements (slope is 1.01 with standard error 0.025 and adjusted R² 0.993). The algorithm selects all the right features up to n.vars = 10
when it starts to miss some of them:
n.vars  right  wrong 

1  2.00000  1.1000000 
2  4.00000  1.2000000 
3  6.00000  1.6666667 
4  8.00000  1.3333333 
5  10.00000  1.6666667 
6  12.00000  1.3333333 
7  14.00000  1.0000000 
8  16.00000  1.3333333 
9  18.00000  0.6666667 
10  20.00000  0.3333333 
20  39.33333  0.0000000 
30  56.33333  0.0000000 
A higher accuracy in the feature selection for the larger problems could presumably be achieved by adjusting the maxRuns
and perhaps confidence
parameters on the Boruta
call.
In summary, the Boruta package performs well up to about 20 features out of 100 (n.vars = 10
) which runs in about 11 minutes on my machine. If we changed the technical implementation to support multicore, MPI, and other parallel frameworks, then the out of the box settings would be useful up to n.vars
of 20 or 30 (4060 features out of 200300) which an 8core machine should be able to complete in 20 minutes or so.
This is still a lot less than the size of data sets we normally work with. (Our usual benchmark is 15,000 variables and 50,000 observations.)
The caret package
One of the nice features of the caret package is that is supports most parallel processing frameworks out of the box, but for comparison with the previous analysis we will (somewhat unfairly) not use that here. The setup is then quite simple, using the same make.data
function as before.
#!/usr/bin/Rscript ## bench.R  benchmark caret package ## Copyright © 2010 Allan Engelhardt (http://www.cybaea.net/) run.name < "bench2" library("caret") library("randomForest") set.seed(1) control < rfeControl(functions = rfFuncs, verbose = FALSE, returnResamp = "final") ## if ( require("multicore", quietly = TRUE, warn.conflicts = FALSE) ) { ## control$workers < multicore:::detectCores() ## control$computeFunction < mclapply ## control$computeArgs < list(mc.preschedule = FALSE, mc.set.seed = FALSE) ## } our.sizes < c(2:10, 10*(2:10), 100*(2:10), 1e3*(2:10)) n.sizes < length(our.sizes) bench < data.frame(n.vars = our.sizes, elapsed = NA, right = NA, wrong = NA) file.name < paste(run.name, "RData", sep = ".") for (n in 1:length(our.sizes)) { size < our.sizes[n] cat(sprintf("[%s] Size = %3d: ", as.character(Sys.time()), size)) tries < max(3, round(10/size, 0)) n.right < 0 n.wrong < 0 elapsed < 0 for (try in 1:tries) { cat(triestry, ".", sep = "") x < make.data(size) x.vars < attr(x, "vars") elapsed < elapsed + system.time({p < rfe(x[,NCOL(x)], x[,NCOL(x)], sizes = 1:(2*size), rfeControl = control)} )["elapsed"] p.vars < predictors(p) n.right < n.right + length(intersect(p.vars, x.vars)) n.wrong < n.wrong + length(setdiff(p.vars, x.vars)) } elapsed < elapsed / tries cat(" Elapsed = ", round(elapsed, 0), " seconds\n", sep = "") n.right < n.right / tries n.wrong < n.wrong / tries bench[n, ] < c(size, elapsed, n.right, n.wrong) save(bench, file = file.name, ascii = FALSE, compress = FALSE) } print(bench)
This uses the randomForest
classifier from the package of the same name. To use the ipredbagg
bagging classifier from Andrea Peters and Torsten Hothorn's ipred: Improved Predictors package we simply change the control
object to:
control < rfeControl(functions = treebagFuncs, verbose = FALSE, returnResamp = "final")
As usual, we were widely optimistic in our guesses for the size of problems we could handle, and had to abort the run.
n.vars  right  wrong 

2  3.20000  3.200000 
3  5.00000  0.000000 
4  7.00000  0.000000 
5  9.00000  0.000000 
6  11.00000  0.000000 
7  13.00000  0.000000 
8  14.66667  0.000000 
9  16.66667  0.000000 
10  19.00000  0.000000 
20  38.66667  1.333333 
30  54.00000  86.000000 
n.vars  right  wrong 

2  3.00000  0 
3  5.00000  0 
4  7.00000  0 
5  9.00000  0 
6  10.33333  0 
7  13.00000  0 
8  14.33333  0 
9  16.00000  0 
10  18.66667  0 
20  35.33333  0 
30  54.33333  0 
40  69.66667  0 
Remember that the right number of significant features are 2 * n.vars
and we see that the caret package apparently always miss one feature in its selection, which is very odd and possibly a bug. It is less likely to select the wrong features than Boruta, but that could be partially due to "Tentative" data in Boruta. Timingwise, performance is a little worse in the nonparallel situation but realistically of course a lot better than Boruta depending on the number of cores on your processor or nodes in your cluster.
Neither approach is suitable out of the box for the sizes of data sets that we normally work with.
Jump to comments.
You may also like these posts:

Feature selection: Allrelevant selection with the Boruta package
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimaloptimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the allrelevant feature selection which identifies all variables that are in some circumstances relevant for the classification. In this article we take a first look at the problem of allrelevant feature selection using the Boruta package by Miron B. Kursa and Witold R. Rudnicki. This package is developed fo…

Feature selection: Using the caret package
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at allrelevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.

Employee productivity as function of number of workers revisited
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE100 constituent companies and find that the relation still holds four years later and across a continent.

Area Plots with Intensity Coloring
I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not …

Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show yo…
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...