Classification Trees using the rpart function

September 21, 2010
By

(This article was first published on Software for Exploratory Data Analysis and Statistical Modelling, and kindly contributed to R-bloggers)

In a previous post on classification trees we considered using the tree package to fit a classification tree to data divided into known classes. In this post we will look at the alternative function rpart that is available within the base R distribution.

Fast Tube
Fast Tube by Casper

A classification tree can be fitted using the rpart function using a similar syntax to the tree function. For the ecoli data set discussed in the previous post we would use:

> require(rpart)
> ecoli.df = read.csv("ecoli.txt")

followed by

> ecoli.rpart1 = rpart(class ~ mcv + gvh + lip + chg + aac + alm1 + alm2, 
  data = ecoli.df)

We would then consider whether the tree could be simplified by pruning and make use of the plotcp function:

> plotcp(ecoli.rpart1)

Once the amount of pruning has been determined from this graph or by looking at the output from the printcp function:

> printcp(ecoli.rpart1)
 
Classification tree:
rpart(formula = class ~ mcv + gvh + lip + chg + aac + alm1 + 
    alm2, data = ecoli.df)
 
Variables actually used in tree construction:
[1] aac  alm1 gvh  mcv 
 
Root node error: 193/336 = 0.5744
 
n= 336 
 
        CP nsplit rel error  xerror     xstd
1 0.388601      0   1.00000 1.00000 0.046959
2 0.207254      1   0.61140 0.61658 0.045423
3 0.062176      2   0.40415 0.45596 0.041758
4 0.051813      3   0.34197 0.38342 0.039359
5 0.031088      4   0.29016 0.36269 0.038571
6 0.015544      5   0.25907 0.30570 0.036136
7 0.010000      6   0.24352 0.31088 0.036375

The prune function is used to simplify the tree based on a cp identified from the graph or printed output threshold.

> ecoli.rpart2 = prune(ecoli.rpart1, cp = 0.02)

The classification tree can be visualised with the plot function and then the text function adds labels to the graph:

> plot(ecoli.rpart2, uniform = TRUE)
> text(ecoli.rpart2, use.n = TRUE, cex = 0.75)

Other useful resources are provided on the Supplementary Material page.

To leave a comment for the author, please follow the link and comment on his blog: Software for Exploratory Data Analysis and Statistical Modelling.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , , , ,

Comments are closed.