(This article was first published on

**R snippets**, and kindly contributed to R-bloggers)Classification trees are known to be unstable with respect to training data. Recently I have read an article on stability of classification trees by Briand et al. (2009). They propose a quantitative similarity measure between two trees. The method is interesting and it inspired me to prepare a simple test data based example showing instability of classification trees.

I compare stability of logistic regression and classification tree on Participation data set from Ecdat package. The method works as follows:

- Divide the data into training and test data set;
- Generate a random subset of training data and build logistic regression and classification tree using them;
- Apply the models on test data to obtain predicted probabilities;
- Repeat steps 2 and 3 many times;
- For each observation in test data set calculate standard deviation of obtained predictions for both classes of models;
- For both models plot kernel density estimator of standard deviation distribution in test data set.

The code performing the above steps is as follows:

library

**(**party**)**library

**(**Ecdat**)**data

**(**Participation**)**set.seed

**(**1**)**shuffle

**<-**Participation**[**sample**(**nrow**(**Participation**))**,**]**test

**<-**shuffle**[**1**:**300,**]**train

**<-**shuffle**[**301**:**nrow**(**Participation**)**,**]**reps

**<-**1000p.tree

**<-**p.log**<-**vector**(**“list”, reps**)****for**

**(**i

**in**1

**:**reps

**)**

**{**

train.sub

**<-**train**[**sample**(**nrow**(**train**))[**1**:**300**]**,**]** mtree

**<-**ctree**(**lfp**~**., data**=**train.sub**)** mlog

**<-**glm**(**lfp**~**., data**=**train.sub, family**=**binomial**)** p.tree

**[[**i**]]****<-**sapply**(**treeresponse**(**mtree, newdata**=**test**)**,**function**

**(**x

**)**

**{**x

**[**2

**]**

**})**

p.log

**[[**i**]]****<-**predict**(**mlog, newdata**=**test, type**=**“response”**)****}**

plot

**(**density**(**apply**(**do.call**(**rbind, p.log**)**, 2, sd**))**, main

**=**“”, xlab**=**“sd”**)**lines

**(**density**(**apply**(**do.call**(**rbind, p.tree**)**, 2, sd**))**, col**=**“red”**)**legend

**(**“topright”, legend**=**c**(**“logistic”, “tree”**)**, col

**=**c**(**“black”,“red”**)**, lty**=**1**)**And here is the generated comparison. As it can be clearly seen logistic regression gives much more stable predictions in comparison to classification tree.

To

**leave a comment**for the author, please follow the link and comment on their blog:**R snippets**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...