# Stability of classification trees

**R snippets**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Classification trees are known to be unstable with respect to training data. Recently I have read an article on stability of classification trees by Briand et al. (2009). They propose a quantitative similarity measure between two trees. The method is interesting and it inspired me to prepare a simple test data based example showing instability of classification trees.

I compare stability of logistic regression and classification tree on Participation data set from Ecdat package. The method works as follows:

- Divide the data into training and test data set;
- Generate a random subset of training data and build logistic regression and classification tree using them;
- Apply the models on test data to obtain predicted probabilities;
- Repeat steps 2 and 3 many times;
- For each observation in test data set calculate standard deviation of obtained predictions for both classes of models;
- For both models plot kernel density estimator of standard deviation distribution in test data set.

The code performing the above steps is as follows:

**(**party

**)**

**(**Ecdat

**)**

**(**Participation

**)**

**(**1

**)**

**<-**Participation

**[**sample

**(**nrow

**(**Participation

**))**,

**]**

**<-**shuffle

**[**1

**:**300,

**]**

**<-**shuffle

**[**301

**:**nrow

**(**Participation

**)**,

**]**

**<-**1000

**<-**p.log

**<-**vector

**(**“list”, reps

**)**

**for**

**(**i

**in**1

**:**reps

**)**

**{**

**<-**train

**[**sample

**(**nrow

**(**train

**))[**1

**:**300

**]**,

**]**

**<-**ctree

**(**lfp

**~**., data

**=**train.sub

**)**

**<-**glm

**(**lfp

**~**., data

**=**train.sub, family

**=**binomial

**)**

**[[**i

**]]**

**<-**sapply

**(**treeresponse

**(**mtree, newdata

**=**test

**)**,

**function**

**(**x

**)**

**{**x

**[**2

**]**

**})**

**[[**i

**]]**

**<-**predict

**(**mlog, newdata

**=**test, type

**=**“response”

**)**

**}**

**(**density

**(**apply

**(**do.call

**(**rbind, p.log

**)**, 2, sd

**))**,

**=**“”, xlab

**=**“sd”

**)**

**(**density

**(**apply

**(**do.call

**(**rbind, p.tree

**)**, 2, sd

**))**, col

**=**“red”

**)**

**(**“topright”, legend

**=**c

**(**“logistic”, “tree”

**)**,

**=**c

**(**“black”,“red”

**)**, lty

**=**1

**)**

And here is the generated comparison. As it can be clearly seen logistic regression gives much more stable predictions in comparison to classification tree.

**leave a comment**for the author, please follow the link and comment on their blog:

**R snippets**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.