Comparing model selection methods

[This article was first published on R snippets, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The standard textbook analysis of different model selection methods, like cross-validation or validation sample, focus on their ability to estimate in-sample, conditional or expected test error. However, the other interesting question is to compare them by their ability to select the true model.
To test this I have thought to generate data following the process y = x1 + ε and test two linear models. In the first one only intercept and parameter for x1 are estimated and in the second a random noise variable x2 is added.

I decided to compare 5-fold Cross-Validation, Adjusted R-squared and Validation Sample (with 70/30 split) methods by testing their ability to select the true model – i.e. the one without x2 variable. I run the test assuming initial sample size equal to 10, 100, 1 000 and 10 000. Here goes my code:

library(boot)
set.seed(1)

decision <- function(n) {
      x1 <- runif(n)
      x2 <- runif(n)
      y <- x1 + rnorm(n)
      train.size <- round(0.7 * n)
      data.set <- data.frame(x1, x2, y)
      train.data <- data.set[1 : train.size,]
      validation.data <- data.set[(train.size + 1) : n,]
      formulas <- list(y ~ x1, y ~ x1 + x2)
      cv <- arsq <- valid <- list()
      for (i in 1:2) {
            cv[[i]] <- cv.glm(data.set, glm(formulas[[i]]),
                              K = 5)$delta[1]
            arsq[[i]] <- summary(lm(formulas[[i]]))$adj.r.squared
            valid.lm <- lm(formulas[[i]], data = train.data)
            valid[[i]] <- mean((predict(valid.lm, validation.data)
                           validation.data$y)^2)
      }
      return(c(cv[[1]] < cv[[2]],
               arsq[[1]] < arsq[[2]],
               valid[[1]] < valid[[2]]))
}

correct <- function(n) {
      rowMeans(replicate(2000, decision(n)))
}

n <- c(10, 100, 1000, 10000)
results <- sapply(n, correct)
rownames(results) <- c(“CV”, “adjR2”, “VALID”)
colnames(results) <- n
print(results)

The results are given below and are quite interesting:

           10    100   1000  10000
CV     0.7260 0.6430 0.6585 0.6405
adjR2  0.3430 0.3265 0.3425 0.3035
VALID  0.6195 0.6275 0.6220 0.6240

Cross-Validation performance is the best but deteriorates with sample size. Validation Sample approach performance does not change with sample size and is a bit worse than Cross-Validation. Adjusted R-squared method selects the wrong model more often than the right one.

What is interesting for me is that increasing sample size does not lead to sure selection of the true model for Cross-Validation and Validation Sample methods. And sure – for large samples the estimate of the parameter for variale x2 is near 0 so the mistake is not very significant, but I did not expect such results.

To leave a comment for the author, please follow the link and comment on their blog: R snippets.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)