In preparing the following example for my forthcoming book, I was startled by how different two implementations of Classification and Regression Trees (CART) performed on a particular data set. Here is what happened:
For my example, I first used the Vertebral Column data set from the UCI Machine Learning Repository. The task is to classify patients into one of three vertebral categories. There are 310 observations, with only 6 predictor variables, so there should be no problem of overfitting. Using a straight logistic model, I achieved about 88% accuracy.
I then tried CART, using the rpart package. (Note, throughout the book, I try to stick to default values of the arguments.) Here is the code:
> vert <- read.table('column_3C.dat',header=FALSE) > library(rpart) > rpvert <- rpart(V7 ~ .,data=vert,method='class') > rpypred <- predict(rpvert,type='class') > mean(rpypred == vert$V7)  0.883871
OK, very similar.
Then I tried the well-known Letters Recognition data set from UCI. This too is a classification problem, one class for each of the capital letters â€˜Aâ€™ through â€˜Zâ€™, with 20,000 observations. (The letters are represented about equally frequently in this data, so the priors are â€˜wrongâ€™, but that is not the issue here.) There are 16 predictor variables. I got about 84% accuracy, again using logit (All vs. All).
However, rpart did poorly:
> rplr <- rpart(lettr ~ .,data=lr,method='class') > rpypred <- predict(rplr,type='class') > mean(rpypred == lr$lettr)  0.4799
Of course, potential deficiences in CART led the original developers of CART to the notion of random forests, so I gave that a try.
> rflr <- randomForest(lettr ~ .,data=lr) > rfypred <- predict(rflr) > mean(rfypred == lr$lettr)  0.96875
Really? Can there be that vast a difference between CART and random forests? And by the way, I got about 91% accuracy with k-Nearest Neighbors (implemented in the knnest function from my regtools package on CRAN).
I speculated that the cause might be that the response variables here (26 of them) are non-monotonically related to the predictors. I put that theory to a couple of the originators of CART, Richard Olshen and Chuck Stone, but they didnâ€™t seem to think it is an issue. But while it is true that nonmonotonicity should eventually be handled by predictors being split multiple times, I still suspect it could be the culprit, say due to the tree-building process stopping too early.
On that hunch, I tried another implementation of CART, ctree from the partykit package, which uses quite different splitting and stopping rules. This was considerably better:
> library(partykit) > ctout <- ctree(lettr ~ .,data=lr) > ctpred <- predict(ctout,lr) > mean(ctpred == lr$lettr)  0.8552
Hmmâ€¦Not sure what to make of this, but it certainly is thought-provoking.
By the way, partykit includes a random forest implementation as well, but it is slow and can be a memory hog. The authors still consider it experimental.