Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The standard textbook analysis of different model selection methods, like cross-validation or validation sample, focus on their ability to estimate in-sample, conditional or expected test error. However, the other interesting question is to compare them by their ability to select the true model.
To test this I have thought to generate data following the process y = x1 + ε and test two linear models. In the first one only intercept and parameter for x1 are estimated and in the second a random noise variable x2 is added.

I decided to compare 5-fold Cross-Validation, Adjusted R-squared and Validation Sample (with 70/30 split) methods by testing their ability to select the true model – i.e. the one without x2 variable. I run the test assuming initial sample size equal to 10, 100, 1 000 and 10 000. Here goes my code:

library(boot)
set.seed(1)

decision <- function(n) {
x1 <- runif(n)
x2 <- runif(n)
y <- x1 + rnorm(n)
train.size <- round(0.7 * n)
data.set <- data.frame(x1, x2, y)
train.data <- data.set[1 : train.size,]
validation.data <- data.set[(train.size + 1) : n,]
formulas <- list(y ~ x1, y ~ x1 + x2)
cv <- arsq <- valid <- list()
for (i in 1:2) {
cv[[i]] <- cv.glm(data.set, glm(formulas[[i]]),
K = 5)$delta arsq[[i]] <- summary(lm(formulas[[i]]))$adj.r.squared
valid.lm <- lm(formulas[[i]], data = train.data)
valid[[i]] <- mean((predict(valid.lm, validation.data)
validation.data\$y)^2)
}
return(c(cv[] < cv[],
arsq[] < arsq[],
valid[] < valid[]))
}

correct <- function(n) {
rowMeans(replicate(2000, decision(n)))
}

n <- c(10, 100, 1000, 10000)
results <- sapply(n, correct)
colnames(results) <- n
print(results)

The results are given below and are quite interesting:

10    100   1000  10000
CV     0.7260 0.6430 0.6585 0.6405
VALID  0.6195 0.6275 0.6220 0.6240

Cross-Validation performance is the best but deteriorates with sample size. Validation Sample approach performance does not change with sample size and is a bit worse than Cross-Validation. Adjusted R-squared method selects the wrong model more often than the right one.

What is interesting for me is that increasing sample size does not lead to sure selection of the true model for Cross-Validation and Validation Sample methods. And sure – for large samples the estimate of the parameter for variale x2 is near 0 so the mistake is not very significant, but I did not expect such results.