**R snippets**, and kindly contributed to R-bloggers)

The standard textbook analysis of different model selection methods, like cross-validation or validation sample, focus on their ability to estimate in-sample, conditional or expected test error. However, the other interesting question is to compare them by their ability to select the true model.

To test this I have thought to generate data following the process y = x1 + ε and test two linear models. In the first one only intercept and parameter for x1 are estimated and in the second a random noise variable x2 is added.

I decided to compare 5-fold Cross-Validation, Adjusted R-squared and Validation Sample (with 70/30 split) methods by testing their ability to select the true model – i.e. the one without x2 variable. I run the test assuming initial sample size equal to 10, 100, 1 000 and 10 000. Here goes my code:

**(**boot

**)**

**(**1

**)**

**<-**

**function**

**(**n

**)**

**{**

**<-**runif

**(**n

**)**

**<-**runif

**(**n

**)**

**<-**x1

**+**rnorm

**(**n

**)**

**<-**round

**(**0.7

*****n

**)**

**<-**data.frame

**(**x1, x2, y

**)**

**<-**data.set

**[**1

**:**train.size,

**]**

**<-**data.set

**[(**train.size

**+**1

**)**

**:**n,

**]**

**<-**list

**(**y

**~**x1, y

**~**x1

**+**x2

**)**

**<-**arsq

**<-**valid

**<-**list

**()**

**for**

**(**i

**in**1

**:**2

**)**

**{**

**[[**i

**]]**

**<-**cv.glm

**(**data.set, glm

**(**formulas

**[[**i

**]])**,

K

**=**5

**)$**delta

**[**1

**]**

**[[**i

**]]**

**<-**summary

**(**lm

**(**formulas

**[[**i

**]]))$**adj.r.squared

**<-**lm

**(**formulas

**[[**i

**]]**, data

**=**train.data

**)**

**[[**i

**]]**

**<-**mean

**((**predict

**(**valid.lm, validation.data

**)**

**–**validation.data

**$**y

**)^**2

**)**

**}**

**(**c

**(**cv

**[[**1

**]]**

**<**cv

**[[**2

**]]**,

**[[**1

**]]**

**<**arsq

**[[**2

**]]**,

**[[**1

**]]**

**<**valid

**[[**2

**]]))**

**}**

**<-**

**function**

**(**n

**)**

**{**

**(**replicate

**(**2000, decision

**(**n

**)))**

**}**

**<-**c

**(**10, 100, 1000, 10000

**)**

**<-**sapply

**(**n, correct

**)**

**(**results

**)**

**<-**c

**(**“CV”, “adjR2”, “VALID”

**)**

**(**results

**)**

**<-**n

**(**results

**)**

The results are given below and are quite interesting:

Cross-Validation performance is the best *but* deteriorates with sample size. Validation Sample approach performance does not change with sample size and is a bit worse than Cross-Validation. Adjusted R-squared method selects the wrong model more often than the right one.

What is interesting for me is that increasing sample size does not lead to sure selection of the true model for Cross-Validation and Validation Sample methods. And sure – for large samples the estimate of the parameter for variale x2 is near 0 so the mistake is not very significant, but I did not expect such results.

**leave a comment**for the author, please follow the link and comment on their blog:

**R snippets**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...