Regression tree using Gini’s index

Posted on January 27, 2013 by arthur charpentier in Uncategorized | 0 Comments

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In order to illustrate the construction of regression tree (using the CART methodology), consider the following simulated dataset,

> set.seed(1)
> n=200
> X1=runif(n)
> X2=runif(n)
> P=.8*(X1<.3)*(X2<.5)+
+   .2*(X1<.3)*(X2>.5)+
+   .8*(X1>.3)*(X1<.85)*(X2<.3)+
+   .2*(X1>.3)*(X1<.85)*(X2>.3)+
+   .8*(X1>.85)*(X2<.7)+
+   .2*(X1>.85)*(X2>.7) 
> Y=rbinom(n,size=1,P)  
> B=data.frame(Y,X1,X2)

with one dichotomos varible (the variable of interest, $Y$ ), and two continuous ones (the explanatory ones $X_1$ and $X_2$ ).

> tail(B)
    Y        X1        X2
195 0 0.2832325 0.1548510
196 0 0.5905732 0.3483021
197 0 0.1103606 0.6598210
198 0 0.8405070 0.3117724
199 0 0.3179637 0.3515734
200 1 0.7828513 0.1478457

The theoretical partition is the following

Here, the sample can be plotted below (be careful, the first variate is on the y-axis above, and the x-axis below) with blue dots when $Y$ equals one, and red dots when $Y$ is null,

> plot(X1,X2,col="white")
> points(X1[Y=="1"],X2[Y=="1"],col="blue",pch=19)
> points(X1[Y=="0"],X2[Y=="0"],col="red",pch=19)

In order to construct the tree, we need a partition critera. The most standard one is probably Gini’s index, which can be writen, when $X$ ‘s are splited in two classes, denoted here $\{A,B\}$

$L'image “https://i1.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-04.png?w=578” ne peut être affichée car elle contient des erreurs.$

or when $X$ ‘s are splited in three classes, denoted $\{A,B,C\}$
$https://i1.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-07.png?w=578$

etc. Here, $n_{x,y}$ are just counts of observations that belong to partition $x$ such that $Y$ takes value $y$ . But it is possible to consider other criteria, such as the chi-square distance,

$https://i2.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-01.png?w=578$

where, classically

$https://i1.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-02.png?w=578$

when we consider two classes (one knot) or, in the case of three classes (two knots)

$https://i0.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-05.png?w=578$

Here again, the idea is to maximize that distance: the idea is to discriminate, so we want samples as not independent as possible. To compute Gini’s index consider

> GINI=function(y,i){
+ T=table(y,i)
+ nx=apply(T,2,sum)
+ pxy=T/matrix(rep(nx,each=2),2,ncol(T))
+ vxy=pxy*(1-pxy)
+ zx=apply(vxy,2,sum)
+ n=sum(T)
+ -sum(nx/n*zx)
+ }

We simply construct the contingency table, and then, compute the quantity given above. Assume, first, that there is only one explanatory variable. We split the sample in two, with all possible spliting values $s$ , i.e.

$\{[x_{\min},s],[s,x_{\max}]\}$

Then, we compute Gini’s index, for all those values. The knot is the value that maximizes Gini’s index. Once we have our first knot, we keep it (call it, from now on $s^\star$ ). And we reiterate, by seeking the best second choice: given one knot, consider the value that splits the sample in three, and give the highest Gini’s index, Thus, we consider either the following partition

$\{[x_{\min},s],[s,s^\star],[s^\star,x_{\max}]\}$

or this one

$\{[x_{\min},s^\star],[s^\star,s],[s,x_{\max}]\}$

I.e. we cut either below, or above the previous knot. And we iterate. The code can be something like that,

> X=X2
> u=(sort(X)[2:n]+sort(X)[1:(n-1)])/2
> knot=NULL
> for(s in 1:4){
+ vgini=rep(NA,length(u))
+ for(i in 1:length(u)){
+ kn=c(knot,u[i])
+ F=function(x){sum(x<=kn)}
+ I=Vectorize(F)(X)
+ vgini[i]=GINI(Y,I)
+ }
+ plot(u,vgini)
+ k=which.max(vgini)
+ cat("knot",k,u[k],"\n")
+ knot=c(knot,u[k])
+ u=u[-k]
+ }
knot 69 0.3025479 
knot 133 0.5846202 
knot 72 0.3148172 
knot 111 0.4811517

At the first step, the value of Gini’s index was the following,

which was maximal around 0.3. Then, this value is considered as fixed. And we try to construct a partition in three parts (spliting either below or above 0.3). We get the following plot for Gini’s index (as a function of this second knot)

which is maximum when the split the sample around 0.6 (which becomes our second knot). Etc. Now, let us compare our code with the standard R function,

> tree(Y~X2,method="gini")
node), split, n, deviance, yval
      * denotes terminal node

 1) root 200 49.8800 0.4750  
   2) X2 < 0.302548 69 12.8100 0.7536 *
   3) X2 > 0.302548 131 28.8900 0.3282  
     6) X2 < 0.58462 65 16.1500 0.4615  
      12) X2 < 0.324591 7  0.8571 0.1429 *
      13) X2 > 0.324591 58 14.5000 0.5000 *
     7) X2 > 0.58462 66 10.4400 0.1970 *

We do obtain similar knots: the first one is 0.302 and the second one 0.584. So, constructing tree is not that difficult…

Now, what if we consider our two explanatory variables? The story remains the same, except that the partition is now a bit more complex to write. To find the first knot, we consider all values on the two components, and again, keep the one that maximizes Gini’s index,

> n=nrow(B)
> u1=(sort(X1)[2:n]+sort(X1)[1:(n-1)])/2
> u2=(sort(X2)[2:n]+sort(X2)[1:(n-1)])/2
> gini=matrix(NA,nrow(B)-1,2)
> for(i in 1:length(u1)){
+ I=(X1<u1[i])
+ gini[i,1]=GINI(Y,I)
+ I=(X2<u2[i])
+ gini[i,2]=GINI(Y,I)
+ }
> mg=max(gini)
> i=1+sum(mg==max(gini[,2]))
> par(mfrow = c(1, 2))
> plot(u1,gini[,1],ylim=range(gini),col="green",type="b",xlab="X1",ylab="Gini index")
> abline(h=mg,lty=2,col="red")
> if(i==1){points(u1[which.max(gini[,1])],mg,pch=19,col="red")
+          segments(u1[which.max(gini[,1])],mg,u1[which.max(gini[,1])],-100000)}
> plot(u2,gini[,2],ylim=range(gini),col="green",type="b",xlab="X2",ylab="Gini index")
> abline(h=mg,lty=2,col="red")
> if(i==2){points(u2[which.max(gini[,2])],mg,pch=19,col="red")
+          segments(u2[which.max(gini[,2])],mg,u2[which.max(gini[,2])],-100000)}
> u2[which.max(gini[,2])]
[1] 0.3025479

The graphs are the following: either we split on the first component (and we obtain the partition on the right, below),

or we split on the second one (and we get the following partition),

Here, it is optimal to split on the second variate, first. And actually, we get back to the one-dimensional case discussed previously: as expected, it is optimal to split around 0.3. This is confirmed with the code below,

> library(tree)
> arbre=tree(Y~X1+X2,data=B,method="gini")
> arbre$frame[1:4,]
     var   n       dev      yval splits.cutleft splits.cutright
1     X2 200 49.875000 0.4750000      <0.302548       >0.302548
2     X1  69 12.811594 0.7536232      <0.800113       >0.800113
4 <leaf>  57  8.877193 0.8070175                               
5 <leaf>  12  3.000000 0.5000000

For the second knot, four cases should be considered: spliting on the second variable (again), either above, or below the previous knot (see below on the left) or spliting on the first one. Then whe have wither a partition below or above the previous knot (see below on the right),

Etc. To visualize the tree, the code is the following

> plot(arbre)
> text(arbre)
> partition.tree(arbre)

Note that we can also visualize the partition. Nice, isn’t it?

To go further, the book Classification and Regression Trees by Leo Breiman (and co-authors) is awesome. Note that there are also interesting sections in the bible Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Friedman (which can be downloaded from http://www.stanford.edu/~hastie/…)

Arthur Charpentier

Arthur Charpentier, professor at UQaM in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1. Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.