(This article was first published on

The standard textbook analysis of different model selection methods, like cross-validation or validation sample, focus on their ability to estimate in-sample, conditional or expected test error. However, the other interesting question is to compare them by their ability to select the true model.**R snippets**, and kindly contributed to R-bloggers)To test this I have thought to generate data following the process y = x1 + ε and test two linear models. In the first one only intercept and parameter for x1 are estimated and in the second a random noise variable x2 is added.

I decided to compare 5-fold Cross-Validation, Adjusted R-squared and Validation Sample (with 70/30 split) methods by testing their ability to select the true model - i.e. the one without x2 variable. I run the test assuming initial sample size equal to 10, 100, 1 000 and 10 000. Here goes my code:

library

**(**boot**)**set.seed

**(**1**)**decision

**<-****function****(**n**)****{** x1

**<-**runif**(**n**)** x2

**<-**runif**(**n**)** y

**<-**x1**+**rnorm**(**n**)** train.size

**<-**round**(**0.7*****n**)** data.set

**<-**data.frame**(**x1, x2, y**)** train.data

**<-**data.set**[**1**:**train.size,**]** validation.data

**<-**data.set**[(**train.size**+**1**)****:**n,**]** formulas

**<-**list**(**y**~**x1, y**~**x1**+**x2**)** cv

**<-**arsq**<-**valid**<-**list**()****for**

**(**i

**in**1

**:**2

**)**

**{**

cv

K

**[[**i**]]****<-**cv.glm**(**data.set, glm**(**formulas**[[**i**]])**,K

**=**5**)$**delta**[**1**]** arsq

**[[**i**]]****<-**summary**(**lm**(**formulas**[[**i**]]))$**adj.r.squared valid.lm

**<-**lm**(**formulas**[[**i**]]**, data**=**train.data**)** valid

**[[**i**]]****<-**mean**((**predict**(**valid.lm, validation.data**)****-**validation.data

**$**y

**)^**2

**)**

**}**

return

**(**c**(**cv**[[**1**]]****<**cv**[[**2**]]**, arsq

**[[**1**]]****<**arsq**[[**2**]]**, valid

**[[**1**]]****<**valid**[[**2**]]))****}**

correct

**<-****function****(**n**)****{** rowMeans

**(**replicate**(**2000, decision**(**n**)))****}**

n

**<-**c**(**10, 100, 1000, 10000**)**results

**<-**sapply**(**n, correct**)**rownames

**(**results**)****<-**c**(**"CV", "adjR2", "VALID"**)**colnames

**(**results**)****<-**nprint

**(**results**)**The results are given below and are quite interesting:

10 100 1000 10000

CV 0.7260 0.6430 0.6585 0.6405

adjR2 0.3430 0.3265 0.3425 0.3035

VALID 0.6195 0.6275 0.6220 0.6240

Cross-Validation performance is the best

*but*deteriorates with sample size. Validation Sample approach performance does not change with sample size and is a bit worse than Cross-Validation. Adjusted R-squared method selects the wrong model more often than the right one.

What is interesting for me is that increasing sample size does not lead to sure selection of the true model for Cross-Validation and Validation Sample methods. And sure - for large samples the estimate of the parameter for variale x2 is near 0 so the mistake is not very significant, but I did not expect such results.

To

**leave a comment**for the author, please follow the link and comment on his blog:**R snippets**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...