Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently I have calculated Banzhaf power index. I required generation of all subsets of a given set. The code given there was a bit complex and I have decided to write a simple function calculating it. As an example of its application I reproduce Figure 3.5 from Hastie et al. (2009).
The figure shows RSS for all possible linear regressions for prostate cancer data on training subset. The standard approach for such a problem in R is to use leaps package, but I simply wanted to test my function generating all subsets of the set.

Here is the code with all.subsets function generating all subsets and its application to prostate cancer data:

library(plyr)
library(ggplot2)

all.subsets <- function(set) {
n <- length(set)
bin <- expand.grid(rlply(n, c(F, T)))
mlply(bin, function() { set[c()] })
}

file.url <- “http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data”
varlist <- all.subsets(names(data.set)[2:9])

get.reg <- function(vars) {
if (length(vars) == 0) {
vars = “1”
}
vars.form <- paste(“lpsa ~”, paste(vars, collapse = ” + “))
lm(vars.form, data = data.set, subset = train)
}

models <- llply(varlist, get.reg)
c(RSS = sum(x\$residuals ^ 2), k = length(x\$coeff)) })

min.models <- ddply(models.RSS, .(k), function(x) {
xlab = “Subset Size k”, ylab = “ResidualSum-of-Squares”,) +
geom_point(data = min.models, aes(x = k, y = RSS), colour = “red”) +
geom_line(data = min.models, aes(x = k, y = RSS), colour = “red”) +
theme_bw()

And here is the plot it generates: