Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us.

Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches.

For our tests we will use an artificially constructed trivial data sets that the automated process should have no problems with (but we will be disappointed later on this expectation, as we will see). The data set has an equal number of normal and uniform random variables with mean 0 and variance 1 of which 20% are used for the target variable. There are 10 time as many observations as variables. We create a function to set this up:

```make.data <- function (n.var, m.rand = 5, m.obs = 10) {
n.col <- n.var * m.rand
n.obs <- n.col * m.obs * 2
x <- data.frame(N = matrix(rnorm(n = n.col*n.obs),
nrow = n.obs, ncol = n.col),
U = matrix(runif(n = n.col*n.obs,
min = -sqrt(3), max = sqrt(3)), n.obs, n.col))
deps.n <- 1:n.var
deps.u <- (1+n.col):(n.var+n.col)
y <- rowSums(as.matrix(x[, c(deps.n, deps.u)]))
x <- cbind(x, Y = factor(y >= 0, labels=c("N", "P")))
attr(x, "vars") <- names(x)[c(deps.n, deps.u)]
return(x)
}
```

The Boruta package

Then we run a test using the Boruta package for different sizes:

```#!/usr/bin/Rscript
## bench.R - benchmark Boruta package
run.name <- "bench-1"
library("Boruta")

set.seed(1)

sizes <- c(1:10, 10*(2:10), 100*(2:10), 1e3*(2:10))
n.sizes <- length(sizes)
bench <- data.frame(n.vars = sizes, elapsed = NA, right = NA, wrong = NA)
file.name <- paste(run.name, "RData", sep = ".")

for (n in 1:length(sizes)) {
size <- sizes[n]
cat(sprintf("[%s] Size = %3d: ", as.character(Sys.time()), size))
tries <- max(3, round(10/size, 0))
n.right <- 0
n.wrong <- 0
elapsed <- 0
for (try in 1:tries) {
cat(tries-try, ".", sep = "")
x <- make.data(size)
x.vars <- attr(x, "vars")
elapsed <- elapsed +
system.time({b <- Boruta(x[,-NCOL(x)], x[,NCOL(x)])}
)["elapsed"]
b.vars <- names(b\$finalDecision)[b\$finalDecision!="Rejected"]
n.right <- n.right + length(intersect(b.vars, x.vars))
n.wrong <- n.wrong + length(setdiff(b.vars, x.vars))
}
elapsed <- elapsed / tries
cat(" Elapsed = ", round(elapsed, 0), " seconds\n", sep = "")
n.right <- n.right / tries
n.wrong <- n.wrong / tries
bench[n, ] <- c(size, elapsed, n.right, n.wrong)
save(bench, file = file.name, ascii = FALSE, compress = FALSE)
}

print(bench)
```

As it turned out, our expectations for the size of data set we could handle were wildly optimistic and we killed the process at size 30. We add to the data set a field with the total number of variables in the `x` data set and plot the results.

```load(file = "bench-1.RData")
bench <- na.omit(bench)
bench\$n.elem <- bench\$n.var^2 * 1e3
plot(elapsed ~ n.elem, data = bench, type = "b",
main = "Feature selections with Boruta",
sub = "Elapsed time versus number of data elements",
log = "xy",
xlab = "Elements in data set", ylab = "Elapsed time (seconds)")
```

Benchmarking results for feature selection with Boruta package shows linear scaling (slope is 1.01 with standard error 0.025 and adjusted R² 0.993)

A quick check using `summary(lm(log(elapsed) ~ log(n.elem), data = bench))` shows us a linear scaling with the number of elements (slope is 1.01 with standard error 0.025 and adjusted R² 0.993). The algorithm selects all the right features up to `n.vars = 10` when it starts to miss some of them:

Benchmark results for Boruta package
n.varsrightwrong
12.000001.1000000
24.000001.2000000
36.000001.6666667
48.000001.3333333
510.000001.6666667
612.000001.3333333
714.000001.0000000
816.000001.3333333
918.000000.6666667
1020.000000.3333333
2039.333330.0000000
3056.333330.0000000

A higher accuracy in the feature selection for the larger problems could presumably be achieved by adjusting the `maxRuns` and perhaps `confidence` parameters on the `Boruta` call.

In summary, the Boruta package performs well up to about 20 features out of 100 (`n.vars = 10`) which runs in about 11 minutes on my machine. If we changed the technical implementation to support multicore, MPI, and other parallel frameworks, then the out of the box settings would be useful up to `n.vars` of 20 or 30 (40-60 features out of 200-300) which an 8-core machine should be able to complete in 20 minutes or so.

This is still a lot less than the size of data sets we normally work with. (Our usual benchmark is 15,000 variables and 50,000 observations.)

The caret package

One of the nice features of the caret package is that is supports most parallel processing frameworks out of the box, but for comparison with the previous analysis we will (somewhat unfairly) not use that here. The setup is then quite simple, using the same `make.data` function as before.

```#!/usr/bin/Rscript
## bench.R - benchmark caret package
run.name <- "bench-2"
library("caret")
library("randomForest")
set.seed(1)

control <- rfeControl(functions = rfFuncs, verbose = FALSE,
returnResamp = "final")

## if ( require("multicore", quietly = TRUE, warn.conflicts = FALSE) ) {
##     control\$workers <- multicore:::detectCores()
##     control\$computeFunction <- mclapply
##     control\$computeArgs <- list(mc.preschedule = FALSE, mc.set.seed = FALSE)
## }

our.sizes <- c(2:10, 10*(2:10), 100*(2:10), 1e3*(2:10))
n.sizes <- length(our.sizes)
bench <- data.frame(n.vars = our.sizes, elapsed = NA, right = NA, wrong = NA)
file.name <- paste(run.name, "RData", sep = ".")

for (n in 1:length(our.sizes)) {
size <- our.sizes[n]
cat(sprintf("[%s] Size = %3d: ", as.character(Sys.time()), size))
tries <- max(3, round(10/size, 0))
n.right <- 0
n.wrong <- 0
elapsed <- 0
for (try in 1:tries) {
cat(tries-try, ".", sep = "")
x <- make.data(size)
x.vars <- attr(x, "vars")
elapsed <- elapsed +
system.time({p <- rfe(x[,-NCOL(x)], x[,NCOL(x)],
sizes = 1:(2*size), rfeControl = control)}
)["elapsed"]
p.vars <- predictors(p)
n.right <- n.right + length(intersect(p.vars, x.vars))
n.wrong <- n.wrong + length(setdiff(p.vars, x.vars))
}
elapsed <- elapsed / tries
cat(" Elapsed = ", round(elapsed, 0), " seconds\n", sep = "")
n.right <- n.right / tries
n.wrong <- n.wrong / tries
bench[n, ] <- c(size, elapsed, n.right, n.wrong)
save(bench, file = file.name, ascii = FALSE, compress = FALSE)
}

print(bench)
```

This uses the `randomForest` classifier from the package of the same name. To use the `ipredbagg` bagging classifier from Andrea Peters and Torsten Hothorn's ipred: Improved Predictors package we simply change the `control` object to:

```control <- rfeControl(functions = treebagFuncs, verbose = FALSE,
returnResamp = "final")
```

As usual, we were widely optimistic in our guesses for the size of problems we could handle, and had to abort the run.

Benchmarking results for feature selection with caret package using randomForest classifier (slope is 1.17 with standard error 0.024 and adjusted R² 0.996)

Benchmarking results for feature selection with caret package using treebag classifier shows non-power behaviour (nevertheless, a linear log-log fit gives a slope of 1.12 with standard error 0.067 and adjusted R² 0.96)

Benchmark results for caret package using randomForest classifier
n.varsrightwrong
23.200003.200000
35.000000.000000
47.000000.000000
59.000000.000000
611.000000.000000
713.000000.000000
814.666670.000000
916.666670.000000
1019.000000.000000
2038.666671.333333
3054.0000086.000000
Benchmark results for caret package using ipredbagg classifier
n.varsrightwrong
23.000000
35.000000
47.000000
59.000000
610.333330
713.000000
814.333330
916.000000
1018.666670
2035.333330
3054.333330
4069.666670

Remember that the right number of significant features are `2 * n.vars` and we see that the caret package apparently always miss one feature in its selection, which is very odd and possibly a bug. It is less likely to select the wrong features than Boruta, but that could be partially due to "Tentative" data in Boruta. Timing-wise, performance is a little worse in the non-parallel situation but realistically of course a lot better than Boruta depending on the number of cores on your processor or nodes in your cluster.

Neither approach is suitable out of the box for the sizes of data sets that we normally work with.