# ID3 Classification using data.tree

April 17, 2015
By

(This article was first published on ipub » R, and kindly contributed to R-bloggers)

While preparing this example, I asked my nine-year-old daughter, “Anaïs, imagine you have a basket full of mushrooms. How would you go about finding out which ones are poisonous?”
“Dad, why would I want to know?” she replies, rolling her eyes. “No matter if they are poisonous or not, they are dis-gust-ing.”
“OK, fair enough”, I sigh, “wrong daughter.”
“Naomi”, I turn to my seven year old, “how could you find out if a mushroom is poisonous?”
“Well”, she says. “I’d just eat it. If I die, it’s poisonous.”
“But, come on, there must be a better way! Think, please!”
“Think? Can’t you see that I’m playing? Ask mom if you must!” She groans. So do I.
“Good question, but – … – it doesn’t matter.”
“No Sir, it does matter!”
“I’d have to to know if they are organic, obviously.”
Obviously. I play along, “Well, whatever, assume I found them in the forest, so they are all organic.”
“Aha!”, she says triumphantly. “Then it’s easy! If they are organic, they simply must be good for you. Hence, they cannot be poisonous!”

# Introduction

This introduction provides a stylized example of the capabilities of the R package data.tree. The code for this post was written in less than an hour. This is possible because, thanks to the data.tree package, the implementation of the training algorithm follows the algorithm’s pseudo code almost line by line.

The code is based on data.tree 0.1.6

## What is ID3?

If you are not a blind believer in organic food, and if you are older than 9, there is a non-zero probability that you’d tackle this problem differently than my dear family members. Still, there are dozens of different methods to choose from, but chances are you’d come across a class of models that are called classification trees. These models let you classify observations (e.g. things, outcomes) according to the observations’ qualities, called features. Essentially, all of these models consist of creating a tree, where each node acts as a router. You insert your mushroom instance at the root of the tree, and then, depending on the mushroom’s features (size, points, color, etc.), you follow along a different path, until a leaf node spits out your mushroom’s class, i.e. whether it’s edibel or not.

You might have guessed already that there are two different steps involved in using such a model: training (i.e. constructing the tree), and predicting (i.e. using the tree to predict whether a given mushroom is poisonous). This vignette provides code to do both, using one of the very early algorithms to classify data according to discrete features: ID3.

We will not go into more details about ID3. You will find lots of documentation on this and more refined algorithms on the internet. For example, lecture notes that build on similar data can be found here.

Also, be assured that this example is by no means meant to be used in the real world. It is overly simplistic, and far too little data is used for training. Nor do we claim that this is a complete discussion of the ID3 algorithm, let alone classification models.

On the contrary, the only reason why we chose this example is to provide a simple to grasp application of the data.tree package, in order to demonstrate how easy it is to build hierarchical models with it.

# Feature Selection

As mentioned above, during the prediction step, each node routes our mushroom according to a feature. But how do we chose the feature? Should we first separate our set according to color or size? That is where classification models differ.

In ID3, we pick at each node the feature with the highest Information Gain. In a nutshell, this is the feature which splits the sample in the possibly purest subsets. For example, in the case of mushrooms, “dots” might be a more sensible feature than “organic”.

## Purity and Entropy

We define a subset to be completely pure if it contains only a single class. For example, if a subset contains only poisonous mushrooms, it is completely pure. In R, assuming that the last column contains the class (i.e. the category to be predicted), this can be written as:

IsPure <- function(data) {
length(unique(data[,ncol(data)])) == 1
}

The entropy is a measure of the purity of a dataset.

Entropy <- function( vls ) {
res <- vls/sum(vls) * log2(vls/sum(vls))
res[vls == 0] <- 0
-sum(res)
}

If a dataset is completely pure, then it has entropy 0:

Entropy(c(10, 0))
## [1] 0
Entropy(c(0, 10))
## [1] 0

If a set contains 5 poisonous and 5 edible mushrooms, the entropy becomes 1, as the purity is at its lowest:

Entropy(c(5, 5))
## [1] 1

We can plot the entropy as a function of the number of edible mushrooms in a set of, say, 100 mushrooms:

entropy <- function(edible) Entropy(c(edible, 100 - edible))
entropy <- Vectorize(entropy)
curve( entropy, from = 0, to = 100, xname = 'edible')

## Information Gain

Mathematically, the information gain IG is defined as:

[ IG(T,a) = H(T)-sum_{vin vals(a)}frac{|{textbf{x}in T|x_a=v}|}{|T|} cdot H({textbf{x}in T|x_a=v}) ]

In words, the information gain measures the difference between the entropy before the split, and the weighted sum of the entropies after the split:

So, let’s rewrite that in R:

InformationGain <- function( tble ) {
tble <- as.data.frame.matrix(tble)
entropyBefore <- Entropy(colSums(tble))
s <- rowSums(tble)
entropyAfter <- sum (s / sum(s) * apply(tble, MARGIN = 1, FUN = Entropy ))
informationGain <- entropyBefore - entropyAfter
return (informationGain)
}

For example, using the mushroom data set:

library(data.tree)
data(mushroom)
tble <- table(mushroom[,c('color', 'edibility')])
tble
##        edibility
## color   edible toxic
##   brown      2     0
##   green      1     0
##   red        1     1
InformationGain(tble)
## [1] 0.3219281
InformationGain(table(mushroom[,c('size', 'edibility')]))
## [1] 0.1709506
InformationGain(table(mushroom[,c('points', 'edibility')]))
## [1] 0.3219281

# The ID3 algorithm

## Pseudo code

We are all set for the ID3 training algorithm. We start with the entire training data, and with a root. Then:

1. if the data-set is pure (e.g. all toxic), then
1. construct a leaf having the name of the class (e.g. ‘toxic’)
2. else
1. choose the feature with the highest information gain (e.g. ‘color’)
2. for each value of that feature (e.g. ‘red’, ‘brown’, ‘green’)
1. take the subset of the data-set having that feature value
2. construct a child node having the name of that feature value (e.g. ‘red’)
3. call the algorithm recursively on the child node and the subset

## Implementation in R with the data.tree package

For the following implementation, we assume that the classifying features are in columns 1 to n-1, whereas the class (the edibility) is in the last column.

TrainID3 <- function(node, data) {

node$obsCount <- nrow(data) #if the data-set is pure (e.g. all toxic), then if (IsPure(data)) { #construct a leaf having the name of the pure feature (e.g. 'toxic') child <- node$AddChild(unique(data[,ncol(data)]))
node$feature <- tail(names(data), 1) child$obsCount <- nrow(data)
child$feature <- '' } else { #chose the feature with the highest information gain (e.g. 'color') ig <- sapply(colnames(data)[-ncol(data)], function(x) InformationGain( table(data[,x], data[,ncol(data)]) ) ) feature <- names(ig)[ig == max(ig)][1] node$feature <- feature

#take the subset of the data-set having that feature value
childObs <- split(data[,!(names(data) %in% feature)], data[,feature], drop = TRUE)

for(i in 1:length(childObs)) {
#construct a child having the name of that feature value (e.g. 'red')
child <- node$AddChild(names(childObs)[i]) #call the algorithm recursively on the child and the subset TrainID3(child, childObs[[i]]) } } } ## Training with data We are ready to run the function: tree <- Node$new("mushroom")
TrainID3(tree, mushroom)
print(tree, "feature", "obsCount")
##             levelName   feature obsCount
## 1  mushroom               color        5
## 2   ¦--brown          edibility        2
## 3   ¦   °--edible                      2
## 4   ¦--green          edibility        1
## 5   ¦   °--edible                      1
## 6   °--red                 size        2
## 7       ¦--large      edibility        1
## 8       ¦   °--edible                  1
## 9       °--small      edibility        1
## 10          °--toxic                   1

# Prediction

## The prediction method

Now, let’s predict some classes. For this, we need a predict function, which will route data through our tree:

Predict <- function(tree, features) {
if (tree$children[[1]]$isLeaf) return (tree$children[[1]]$name)
child <- tree$children[[features[[tree$feature]]]]
return ( Predict(child, features))
}

## Using the prediction method

And now we use it to predict:

Predict(tree, c(color = 'red',
size = 'large',
points = 'yes')
)
## [1] "edible"

Oops! Maybe organic wasn’t such a bad predictor, after all

The post ID3 Classification using data.tree appeared first on ipub.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...