Stability of classification trees

[This article was first published on R snippets, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Classification trees are known to be unstable with respect to training data. Recently I have read an article on stability of classification trees by Briand et al. (2009). They propose a quantitative similarity measure between two trees. The method is interesting and it inspired me to prepare a simple test data based example showing instability of classification trees.

I compare stability of logistic regression and classification tree on Participation data set from Ecdat package. The method works as follows:
  1. Divide the data into training and test data set;
  2. Generate a random subset of training data and build logistic regression and classification tree using them;
  3. Apply the models on test data to obtain predicted probabilities;
  4. Repeat steps 2 and 3 many times;
  5. For each observation in test data set calculate standard deviation of obtained predictions for both classes of models;
  6. For both models plot kernel density estimator of standard deviation distribution in test data set.
The code performing the above steps is as follows:

library(party)
library(Ecdat)
data(Participation)
set.seed(1)
shuffle <- Participation[sample(nrow(Participation)),]
test <- shuffle[1:300,]
train <- shuffle[301:nrow(Participation),]
reps <- 1000
p.tree <- p.log <- vector(“list”, reps)

for (i in 1:reps) {
      train.sub <- train[sample(nrow(train))[1:300],]
      mtree <- ctree(lfp ~ ., data = train.sub)
      mlog <- glm(lfp ~ ., data = train.sub, family = binomial)
      p.tree[[i]] <- sapply(treeresponse(mtree, newdata = test),
                                     function(x) { x[2] })
      p.log[[i]] <- predict(mlog, newdata = test, type = “response”)
}
plot(density(apply(do.call(rbind, p.log), 2, sd)),
     main=“”, xlab = “sd”)
lines(density(apply(do.call(rbind, p.tree), 2, sd)), col=“red”)
legend(“topright”, legend = c(“logistic”, “tree”),
       col = c(“black”,“red”), lty = 1)

And here is the generated comparison. As it can be clearly seen logistic regression gives much more stable predictions in comparison to classification tree.

To leave a comment for the author, please follow the link and comment on their blog: R snippets.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)