Stability of classification trees

December 9, 2011
By

(This article was first published on R snippets, and kindly contributed to R-bloggers)

Classification trees are known to be unstable with respect to training data. Recently I have read an article on stability of classification trees by Briand et al. (2009). They propose a quantitative similarity measure between two trees. The method is interesting and it inspired me to prepare a simple test data based example showing instability of classification trees.

I compare stability of logistic regression and classification tree on Participation data set from Ecdat package. The method works as follows:
  1. Divide the data into training and test data set;
  2. Generate a random subset of training data and build logistic regression and classification tree using them;
  3. Apply the models on test data to obtain predicted probabilities;
  4. Repeat steps 2 and 3 many times;
  5. For each observation in test data set calculate standard deviation of obtained predictions for both classes of models;
  6. For both models plot kernel density estimator of standard deviation distribution in test data set.
The code performing the above steps is as follows:

library(party)
library(Ecdat)
data(Participation)
set.seed(1)
shuffle <- Participation[sample(nrow(Participation)),]
test <- shuffle[1:300,]
train <- shuffle[301:nrow(Participation),]
reps <- 1000
p.tree <- p.log <- vector("list", reps)

for (i in 1:reps) {
      train.sub <- train[sample(nrow(train))[1:300],]
      mtree <- ctree(lfp ~ ., data = train.sub)
      mlog <- glm(lfp ~ ., data = train.sub, family = binomial)
      p.tree[[i]] <- sapply(treeresponse(mtree, newdata = test),
                                     function(x) { x[2] })
      p.log[[i]] <- predict(mlog, newdata = test, type = "response")
}
plot(density(apply(do.call(rbind, p.log), 2, sd)),
     main="", xlab = "sd")
lines(density(apply(do.call(rbind, p.tree), 2, sd)), col="red")
legend("topright", legend = c("logistic", "tree"),
       col = c("black","red"), lty = 1)

And here is the generated comparison. As it can be clearly seen logistic regression gives much more stable predictions in comparison to classification tree.

To leave a comment for the author, please follow the link and comment on his blog: R snippets.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.