Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Classification trees are known to be unstable with respect to training data. Recently I have read an article on stability of classification trees by Briand et al. (2009). They propose a quantitative similarity measure between two trees. The method is interesting and it inspired me to prepare a simple test data based example showing instability of classification trees.

I compare stability of logistic regression and classification tree on Participation data set from Ecdat package. The method works as follows:

1. Divide the data into training and test data set;
2. Generate a random subset of training data and build logistic regression and classification tree using them;
3. Apply the models on test data to obtain predicted probabilities;
4. Repeat steps 2 and 3 many times;
5. For each observation in test data set calculate standard deviation of obtained predictions for both classes of models;
6. For both models plot kernel density estimator of standard deviation distribution in test data set.

The code performing the above steps is as follows:

library(party)
library(Ecdat)
data(Participation)
set.seed(1)
shuffle <- Participation[sample(nrow(Participation)),]
test <- shuffle[1:300,]
train <- shuffle[301:nrow(Participation),]
reps <- 1000
p.tree <- p.log <- vector(“list”, reps)
for (i in 1:reps) {
train.sub <- train[sample(nrow(train))[1:300],]
mtree <- ctree(lfp ~ ., data = train.sub)
mlog <- glm(lfp ~ ., data = train.sub, family = binomial)
p.tree[[i]] <- sapply(treeresponse(mtree, newdata = test),
function(x) { x })
p.log[[i]] <- predict(mlog, newdata = test, type = “response”)
}
plot(density(apply(do.call(rbind, p.log), 2, sd)),
main=“”, xlab = “sd”)
lines(density(apply(do.call(rbind, p.tree), 2, sd)), col=“red”)
legend(“topright”, legend = c(“logistic”, “tree”),
col = c(“black”,“red”), lty = 1)

And here is the generated comparison. As it can be clearly seen logistic regression gives much more stable predictions in comparison to classification tree.