(This article was first published on

Recently I had a discussion with a student about variability of results obtained from cross-validation procedure. While the subject is well known there are not many examples on the web showing it, so I have written its simple presentation.**R snippets**, and kindly contributed to R-bloggers)Results from cross-validation are reported as a standard by rpart procedure (printcp and plotcp) and optimal cp is selected for tree pruning. Many people I have talked to think that because each time rpart is run on the same data-set the same tree is obtained that also printcp and plotcp results do not change. However, it should be remembered that x-val relative error returned by them is based on random sampling and is not constant. Therefore two runs of rpart might indicate different values of optimal cp.

Here is the code that illustrates this situation using Participation data from Ecdat package:

library

**(**Ecdat**)**library

**(**rpart**)**data

**(**Participation**)**set.seed

**(**1**)**xerror

**<-**t**(**replicate**(**8192, rpart

**(**lfp**~**., data**=**Participation**)$**cptable**[**,4**]))**tree.size

**<-**factor**(** rpart

**(**lfp**~**., data**=**Participation**)$**cptable**[**,2**]****+**1**)**colnames

**(**xerror**)****<-**tree.sizepar

**(**mfrow**=**c**(**1, 2**))**boxplot

**(**xerror, xlab**=**"size of tree", ylab

**=**"X-val Relative Error"**)**plot

**(**tree.size**[**apply**(**xerror,1, which.min**)]**, xlab**=**"size of tree", ylab

**=**"# minimal"**)**The resulting plot is the following:

We can see that using x-val criterion tree of size 5 is selected in around 2/3 of cases and size 6 is found best otherwise.

The other issue is why there is no variability of x-val for tree size 1 and almost no variability at size 2. The answer is that for those tree sizes the split in every cross-validation fold is made on nominal variable (for example foreign for tree size 1) at the same cut-point and all resulting trees are identical (one outlier for tree size 2 is due single different split). For tree sizes 5 and 6 continuous variables enter the tree (age and lnnlinc) and cut-points start moving, so the resulting trees in different cross-validation runs are different.

To

**leave a comment**for the author, please follow the link and comment on his blog:**R snippets**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...