Classification trees are known to be unstable with respect to training data. Recently I have read an article on stability of classification trees by Briand et al. (2009). They propose a quantitative similarity measure between two trees. The method is interesting and it inspired me to prepare a simple test data based example showing instability of classification trees.
- Divide the data into training and test data set;
- Generate a random subset of training data and build logistic regression and classification tree using them;
- Apply the models on test data to obtain predicted probabilities;
- Repeat steps 2 and 3 many times;
- For each observation in test data set calculate standard deviation of obtained predictions for both classes of models;
- For both models plot kernel density estimator of standard deviation distribution in test data set.
The code performing the above steps is as follows:
And here is the generated comparison. As it can be clearly seen logistic regression gives much more stable predictions in comparison to classification tree.