**Mad (Data) Scientist**, and kindly contributed to R-bloggers)

In preparing the following example for my forthcoming book, I was startled by how different two implementations of Classification and Regression Trees (CART) performed on a particular data set. Here is what happened:

For my example, I first used the Vertebral Column data set from the UCI Machine Learning Repository. The task is to classify patients into one of three vertebral categories. There are 310 observations, with only 6 predictor variables, so there should be no problem of overfitting. Using a straight logistic model, I achieved about 88% accuracy.

I then tried CART, using the **rpart** package. (Note, throughout the book, I try to stick to default values of the arguments.) Here is the code:

```
> vert <- read.table('column_3C.dat',header=FALSE)
> library(rpart)
> rpvert <- rpart(V7 ~ .,data=vert,method='class')
> rpypred <- predict(rpvert,type='class')
> mean(rpypred == vert$V7)
[1] 0.883871
```

OK, very similar.

Then I tried the well-known Letters Recognition data set from UCI. This too is a classification problem, one class for each of the capital letters ‘A’ through ‘Z’, with 20,000 observations. (The letters are represented about equally frequently in this data, so the priors are ‘wrong’, but that is not the issue here.) There are 16 predictor variables. I got about 84% accuracy, again using logit (All vs. All).

However, **rpart** did poorly:

> rplr <- rpart(lettr ~ .,data=lr,method='class') > rpypred <- predict(rplr,type='class') > mean(rpypred == lr$lettr) [1] 0.4799

Of course, potential deficiences in CART led the original developers of CART to the notion of random forests, so I gave that a try.

> rflr <- randomForest(lettr ~ .,data=lr) > rfypred <- predict(rflr) > mean(rfypred == lr$lettr) [1] 0.96875

Really? Can there be that vast a difference between CART and random forests? And by the way, I got about 91% accuracy with k-Nearest Neighbors (implemented in the **knnest** function from my **regtools** package on CRAN).

I speculated that the cause might be that the response variables here (26 of them) are non-monotonically related to the predictors. I put that theory to a couple of the originators of CART, Richard Olshen and Chuck Stone, but they didn’t seem to think it is an issue. But while it is true that nonmonotonicity should eventually be handled by predictors being split multiple times, I still suspect it could be the culprit, say due to the tree-building process stopping too early.

On that hunch, I tried another implementation of CART, **ctree** from the **partykit** package, which uses quite different splitting and stopping rules. This was considerably better:

> library(partykit) > ctout <- ctree(lettr ~ .,data=lr) > ctpred <- predict(ctout,lr) > mean(ctpred == lr$lettr) [1] 0.8552

Hmm…Not sure what to make of this, but it certainly is thought-provoking.

By the way, **partykit** includes a random forest implementation as well, but it is slow and can be a memory hog. The authors still consider it experimental.

**leave a comment**for the author, please follow the link and comment on their blog:

**Mad (Data) Scientist**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...