Comparing model selection methods

December 2, 2011
By

(This article was first published on R snippets, and kindly contributed to R-bloggers)

The standard textbook analysis of different model selection methods, like cross-validation or validation sample, focus on their ability to estimate in-sample, conditional or expected test error. However, the other interesting question is to compare them by their ability to select the true model.
To test this I have thought to generate data following the process y = x1 + ε and test two linear models. In the first one only intercept and parameter for x1 are estimated and in the second a random noise variable x2 is added.

I decided to compare 5-fold Cross-Validation, Adjusted R-squared and Validation Sample (with 70/30 split) methods by testing their ability to select the true model - i.e. the one without x2 variable. I run the test assuming initial sample size equal to 10, 100, 1 000 and 10 000. Here goes my code:

library(boot)
set.seed(1)

decision <- function(n) {
      x1 <- runif(n)
      x2 <- runif(n)
      y <- x1 + rnorm(n)
      train.size <- round(0.7 * n)
      data.set <- data.frame(x1, x2, y)
      train.data <- data.set[1 : train.size,]
      validation.data <- data.set[(train.size + 1) : n,]
      formulas <- list(y ~ x1, y ~ x1 + x2)
      cv <- arsq <- valid <- list()
      for (i in 1:2) {
            cv[[i]] <- cv.glm(data.set, glm(formulas[[i]]),
                              K = 5)$delta[1]
            arsq[[i]] <- summary(lm(formulas[[i]]))$adj.r.squared
            valid.lm <- lm(formulas[[i]], data = train.data)
            valid[[i]] <- mean((predict(valid.lm, validation.data)
                           - validation.data$y)^2)
      }
      return(c(cv[[1]] < cv[[2]],
               arsq[[1]] < arsq[[2]],
               valid[[1]] < valid[[2]]))
}

correct <- function(n) {
      rowMeans(replicate(2000, decision(n)))
}

n <- c(10, 100, 1000, 10000)
results <- sapply(n, correct)
rownames(results) <- c("CV", "adjR2", "VALID")
colnames(results) <- n
print(results)

The results are given below and are quite interesting:

           10    100   1000  10000
CV     0.7260 0.6430 0.6585 0.6405
adjR2  0.3430 0.3265 0.3425 0.3035
VALID  0.6195 0.6275 0.6220 0.6240

Cross-Validation performance is the best but deteriorates with sample size. Validation Sample approach performance does not change with sample size and is a bit worse than Cross-Validation. Adjusted R-squared method selects the wrong model more often than the right one.

What is interesting for me is that increasing sample size does not lead to sure selection of the true model for Cross-Validation and Validation Sample methods. And sure - for large samples the estimate of the parameter for variale x2 is near 0 so the mistake is not very significant, but I did not expect such results.

To leave a comment for the author, please follow the link and comment on his blog: R snippets.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.