I recently read a really interesting blog post about trying to predict who survived on the Titanic with standard GLM models and two forms of non-parametric classification tree (CART) methodology. The post was featured on R-bloggers, and I think it's worth a closer look.
The basic idea was to figure out which of these three model types did a better job of predicting which passengers would survive on the Titanic based on their personal characteristics; these included characteristics like sex, age, the class of the ticket (first, second, or third). For each model, the blogger estimated the model on half the sample (the “training data”) and then predicted the probability of survival for the other half of the sample (the “test data”). Any passenger predicted to have a >50% probability of survival was classified as being predicted to survive. The blogger then determined what proportion of predicted survivors actually survived.
The result, copied from the original blog post:
I think this is a flawed way to assess the predictive power of the model. If a passenger is predicted to have a 50% probability of survival, we should expect this passenger to die half the time (in repeated samples); classifying the person as a “survivor,” a person with a 100% probability of survival, misinterprets the model's prediction. For example, suppose (counterfactually) that a model classified half of a test data set of 200 passengers as having a 10% chance of survival and the other half as having a 60% chance of survival. The 50% threshold binary classification procedure expects there to be 100 survivors, and for all of the survivors to be in the portion of the sample with >50% predicted probability of survival. But it's more realistic to assume that there would be 10 survivors in the 10% group, and 60 survivors in the 60% group, for a total of 70 survivors. Even if the model's predictions were totally accurate, the binary classification method of assessment could easily make its predictions look terrible.
Andrew Pierce and I just published a paper in Political Analysis making this argument. In that paper, we propose assessing a model's predictive accuracy by constructing a plot with the predicted probability of survival on the x-axis, and the empirical proportion of survivors with that predicted probability on the y-axis. The empirical proportion is computed by running a lowess regression of the model's predicted probability against the binary (1/0) survival variable, using the AIC to choose the optimal bandwidth, and then extracting the lowess prediction from the model. We created an R package to perform this process automatically (and to implement a bootstrapping approach to assessing uncertainty in the plot), but this package is designed for the assessment of in-sample fit only. So, I have to construct them manually for this example. The code for everything I've done is here.
Here's the plot for the GLM model:
As you can see, the logit model actually does a very good job of predicting the probability of passenger survival in the test data. It slightly underpredicts the probability of death for passengers who are unlikely to die, and slightly overpredicts the probability of death for the other passengers. But the predicted probabilities for near-certain (Pr(survive) near 0) and nearly impossible (Pr(survive) near 1) deaths, which are most of the data set, are quite accurately predicted.
The random forest model does a perfectly reasonable job of predicting outcomes, but not markedly better:
The pattern of over- and under-predictions is very similar to that of the GLM model. In fact, if you plot the logit predictions against the random forest predictions…
You can see that there are comparatively few cases that are classified much differently between the two models. The primary systematic difference seems to be that the random forest model takes cases that the logit predicts to have a low but positive probability of survival, and reclassifies them as zero probability of survival. I put in the dark vertical and horizontal lines to show which data points the binary classification procedure would deem “survivors” for each model; there are a few observations that are categorized differently by the two models (in the upper left and lower right quadrants of the plot), but most are categorized the same.
Finally, the conditional inference tree model does classify things quite differently, but not in a way that substantially improves the performance of the model:
I've jittered the CTREE predictions a bit so that you can see the data density. The tree essentially creates five categories of predictions, but doesn't appreciably improve the predictive performance inside of those categories above the logit model.
Comparing the GLM logit predictions to the ctree predictions…
…you see the categorizations more clearly. Of course, you can just look at the CART plot to see how these categories are created:
I have to admit, that is a pretty sweet plot.
In conclusion, comparatively ancient GLM methods do surprisingly well on this problem when compared to the CART methods. If anything, the CART methods apparently suppress a decent amount of heterogeneity in probability forecasts that the GLM models uncover. But all of the models have the same basic strengths and weaknesses, in terms of predictive accuracy. And if the heterogeneity of the GLM predictions reflects signal and not noise–and my plots seem to suggest that it is signal–the GLM predictions might well be better for forecasting survival in individual cases.
Maybe some day, I will get around to creating a version of my R package that does out-of-sample forecasts! That way, I could get assessments of the uncertainty around the plot as well.