# Classification Trees

September 18, 2010
By

(This article was first published on Software for Exploratory Data Analysis and Statistical Modelling, and kindly contributed to R-bloggers)

Decision trees are applied to situation where data is divided into groups rather than investigating a numerical response and its relationship to a set of descriptor variables. There are various implementations of classification trees in R and the some commonly used functions are rpart and tree.

Fast Tube by Casper

To illustrate the use of the tree function we will use a set of data from the UCI Machine Learning Repository where the objective of the study using this data was to predict the cellular localization sites of proteins.

The data provided on the website is shown here:

Sequence  mcv  gvh  lip chg  aac alm1 alm2 class
1  AAT_ECOLI 0.49 0.29 0.48 0.5 0.56 0.24 0.35    cp
2 ACEA_ECOLI 0.07 0.40 0.48 0.5 0.54 0.35 0.44    cp
3 ACEK_ECOLI 0.56 0.40 0.48 0.5 0.49 0.37 0.46    cp
4 ACKA_ECOLI 0.59 0.49 0.48 0.5 0.52 0.45 0.36    cp
5  ADI_ECOLI 0.23 0.32 0.48 0.5 0.55 0.25 0.35    cp
6 ALKH_ECOLI 0.67 0.39 0.48 0.5 0.36 0.38 0.46    cp

We can use the xtabs function to summarise the number of cases in each class.

> xtabs( ~ class, data = ecoli.df)
class
cp  im imL imS imU  om omL  pp
143  77   2   2  35  20   5  52

As noted in the comments the package that I used was the tree package:

> require(tree)

The complete classification tree using all variables is fitted to the data initially and then we will try to prune the tree to make it smaller.

> ecoli.tree1 = tree(class ~ mcv + gvh + lip + chg + aac + alm1 + alm2,
data = ecoli.df)
> summary(ecoli.tree1)

Classification tree:
tree(formula = class ~ mcv + gvh + lip + chg + aac + alm1 + alm2,
data = ecoli.df)
Variables actually used in tree construction:
[1] "alm1" "mcv"  "gvh"  "aac"  "alm2"
Number of terminal nodes:  10
Residual mean deviance:  0.7547 = 246 / 326
Misclassification error rate: 0.122 = 41 / 336

The tree function is used in a similar way to other modelling functions in R. The misclassification rate is shown as part of the summary of the tree. This tree can be plotted and annotated with these commands:

> plot(ecoli.tree1)
> text(ecoli.tree1, all = T)

To prune the tree we use cross-validation to identify the point to prune.

> cv.tree(ecoli.tree1)
$size [1] 10 9 8 7 6 5 4 3 2 1$dev
[1]  463.6820  457.4463  447.9824  441.8617  455.8318  478.9234  533.5856  586.2820  713.2992 1040.3878

$k [1] -Inf 12.16500 15.60004 19.21572 34.29868 41.10627 50.57044 64.05494 180.78800 355.67747$method
[1] "deviance"

attr(,"class")
[1] "prune"         "tree.sequence"

This suggests a tree size of 6 and we can re-fit the tree:

> ecoli.tree2 = prune.misclass(ecoli.tree1, best = 6)
> summary(ecoli.tree2)

Classification tree:
snip.tree(tree = ecoli.tree1, nodes = c(4, 20, 7))
Variables actually used in tree construction:
[1] "alm1" "mcv"  "aac"  "gvh"
Number of terminal nodes:  6
Residual mean deviance:  0.9918 = 327.3 / 330
Misclassification error rate: 0.1548 = 52 / 336

The misclassification rate has increased but not substantially with the pruning of the tree.

Other useful resources are provided on the Supplementary Material page.

Data used in this post: Ecoli Data Set.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...