Stability of classification trees

Posted on December 9, 2011 by Bogumił Kamiński in R bloggers | 0 Comments

[This article was first published on R snippets, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Classification trees are known to be unstable with respect to training data. Recently I have read an article on stability of classification trees by Briand et al. (2009). They propose a quantitative similarity measure between two trees. The method is interesting and it inspired me to prepare a simple test data based example showing instability of classification trees.

I compare stability of logistic regression and classification tree on Participation data set from Ecdat package. The method works as follows:

Divide the data into training and test data set;
Generate a random subset of training data and build logistic regression and classification tree using them;
Apply the models on test data to obtain predicted probabilities;
Repeat steps 2 and 3 many times;
For each observation in test data set calculate standard deviation of obtained predictions for both classes of models;
For both models plot kernel density estimator of standard deviation distribution in test data set.

The code performing the above steps is as follows:

library(party)

library(Ecdat)

data(Participation)

set.seed(1)

shuffle <- Participation[sample(nrow(Participation)),]

test <- shuffle[1:300,]

train <- shuffle[301:nrow(Participation),]

reps <- 1000

p.tree <- p.log <- vector(“list”, reps)

for (i in 1:reps) {

train.sub <- train[sample(nrow(train))[1:300],]

mtree <- ctree(lfp ~ ., data = train.sub)

mlog <- glm(lfp ~ ., data = train.sub, family = binomial)

p.tree[[i]] <- sapply(treeresponse(mtree, newdata = test),

function(x) { x[2] })

p.log[[i]] <- predict(mlog, newdata = test, type = “response”)

}

plot(density(apply(do.call(rbind, p.log), 2, sd)),

main=“”, xlab = “sd”)

lines(density(apply(do.call(rbind, p.tree), 2, sd)), col=“red”)

legend(“topright”, legend = c(“logistic”, “tree”),

col = c(“black”,“red”), lty = 1)

And here is the generated comparison. As it can be clearly seen logistic regression gives much more stable predictions in comparison to classification tree.

To leave a comment for the author, please follow the link and comment on their blog: R snippets.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Stability of classification trees

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)