Party with the First Tribe

Posted on October 22, 2015 by Joseph Rickert in R bloggers | 0 Comments

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

In a recent previous post, I wrote about support vector machines, the representative master algorithm of the 5th tribe of machine learning practitioners described by Pedro Domingos in his book, The Master Algorithm. Here we look into algorithms favored by the first tribe, the symbolists, who see learning as the process of inverse deduction. Pedro writes:

Another limitation of inverse deduction is that it's very computational intensive, which makes it hard to scale to massive data sets. For these, the symbolist algorithm of choice is decision tree induction. Decision trees can be viewed as an answer to the question of what to do if rules of more than one concept match an instance. (p85)

The de facto standard for decision trees or “recursive partitioning” trees as they are known in the literature, is the CART algorithm by Breiman et al. (1984) implemented in R's rpart package. Stripped down to it’s essential structure, CART is a two stage algorithm. In the first stage, the algorithm conducts an exhaustive search over each variable to find the best split by maximizing an information criterion that will result in cells that are as pure as possible for one or the other of the class variables. In the second stage, a constant model is fit to each cell of the resulting partition. The algorithm then proceeds in a recursive “greedy” fashion making splits and not looking back to see how things might have been before making the next split. Although hugely successful in practice, the algorithm has two vexing problems: (1) overfitting and (2) selection bias – the algorithm favors features with many possible splits¹. Overfitting occurs because the algorithm has “no concept of statistical significance” ². While overfitting is usually handled with cross validation and pruning there doesn’t seem to be an easy way to deal with selection bias in the CART / rpart framework.

To address these issues Hothorn, Hornik and Zeileis introduced the party package into R about ten years ago which provides an implementation of conditional inference trees. (Unbiased Recursive Partitioning: A Conditional Inference Framework) Party’s ctree() function separates the selection of variables for splitting and the splitting process itself into two different steps and explicitly addresses bias selection by implementing statistical testing and a stopping procedure in the first step. Very roughly, the algorithm proceeds as follows:

Each node of the tree is represented by a set of weights. Then, for each covariate vector X, the algorithm tests the null hypothesis that the dependent variable Y is independent of X. If the hypothesis cannot be rejected then the algorithm stops. Otherwise, the covariate with the strongest association with Y is selected for splitting.
The algorithm performs a split and updates the weights describing the tree.
Steps 1 and 2 are repeated recursively with the new parameter settings.

The details, along with enough theory to use the ctree algorithm with some confidence, are presented in this accessible vignette: “party: A Laboratory for Recursive Partitioning. The following example contrasts the ctree() and rpart() algorithms.

We begin by dividing the segmationData data set that comes with the caret package into training and test sets and fitting a ctree() model to it using the default parameters. No attempt is made to optimize the model. Next, we use the model to predict values of the Class variable on the test data set and calculate the area under the ROC curve to be 0.8326.

# Script to compare ctree with rpart
library(party)
library(rpart)
library(caret)
library(pROC)
 
### Get the Data
# Load the data and construct indices to divide it into training and test data sets.  
 
data(segmentationData)  	# Load the segmentation data set
data <- segmentationData[,3:61]
data$Class <- ifelse(data$Class=="PS",1,0)
#
trainIndex <- createDataPartition(data$Class,p=.7,list=FALSE)
trainData <- data[trainIndex,]
testData  <- data[-trainIndex,]
#------------------------
set.seed(23)
# Fit Conditional Tree Model
ctree.fit <- ctree(Class ~ ., data=trainData)
ctree.fit	
plot(ctree.fit,main="ctree Model")
 
#Make predictions using the test data set
ctree.pred <- predict(ctree.fit,testData)
#Draw the ROC curve 
ctree.ROC <- roc(predictor=as.numeric(ctree.pred),
                 response=testData$Class)
ctree.ROC$auc
#Area under the curve: 0.8326
plot(ctree.ROC,main="ctree ROC")

Here are the text and graphical descriptions of the resulting tree.

1) FiberWidthCh1 <= 9.887543; criterion = 1, statistic = 383.388
  2) TotalIntenCh2 <= 42511; criterion = 1, statistic = 115.137
    3) TotalIntenCh1 <= 39428; criterion = 1, statistic = 20.295
      4)*  weights = 504 
    3) TotalIntenCh1 > 39428
      5)*  weights = 9 
  2) TotalIntenCh2 > 42511
    6) AvgIntenCh1 <= 199.2768; criterion = 1, statistic = 28.037
      7) IntenCoocASMCh3 <= 0.5188792; criterion = 0.99, statistic = 14.022
        8)*  weights = 188 
      7) IntenCoocASMCh3 > 0.5188792
        9)*  weights = 7 
    6) AvgIntenCh1 > 199.2768
      10)*  weights = 36 
1) FiberWidthCh1 > 9.887543
  11) ShapeP2ACh1 <= 1.227156; criterion = 1, statistic = 48.226
    12)*  weights = 169 
  11) ShapeP2ACh1 > 1.227156
    13) IntenCoocContrastCh3 <= 12.32349; criterion = 1, statistic = 22.349
      14) SkewIntenCh4 <= 1.148388; criterion = 0.998, statistic = 16.78
        15)*  weights = 317 
      14) SkewIntenCh4 > 1.148388
        16)*  weights = 109 
    13) IntenCoocContrastCh3 > 12.32349
      17) AvgIntenCh2 <= 244.9512; criterion = 0.999, statistic = 19.382
        18)*  weights = 53 
      17) AvgIntenCh2 > 244.9512
        19)*  weights = 22

Next, we fit an rpart() model to the training data using the default parameter settings and calculate the AUC to be 0.8536 on the test data.

# Fit CART  Model
rpart.fit <- rpart(Class ~ ., data=trainData,cp=0)
rpart.fit		
plot(as.party(rpart.fit),main="rpart Model")
 
#Make predictions using the test data set
rpart.pred <- predict(rpart.fit,testData)
#Draw the ROC curve 
rpart.ROC <- roc(predictor=as.numeric(rpart.pred),
                 response=testData$Class)
rpart.ROC$auc
#Area under the curve: 0.8536
plot(rpart.ROC)

The resulting pruned tree does better than ctree(), but at the expense of building a slightly deeper tree.

1) root 1414 325.211500 0.64144270  
    2) TotalIntenCh2>=42606.5 792 191.635100 0.41035350  
      4) FiberWidthCh1>=11.19756 447  85.897090 0.25950780  
        8) ShapeP2ACh1< 1.225676 155  13.548390 0.09677419 *
        9) ShapeP2ACh1>=1.225676 292  66.065070 0.34589040  
         18) SkewIntenCh4< 1.41772 254  53.259840 0.29921260  
           36) TotalIntenCh4< 127285.5 214  40.373830 0.25233640  
             72) EqEllipseOblateVolCh1>=383.1453 142  19.943660 0.16901410 *
             73) EqEllipseOblateVolCh1< 383.1453 72  17.500000 0.41666670  
              146) AvgIntenCh1>=110.2253 40   6.400000 0.20000000 *
              147) AvgIntenCh1< 110.2253 32   6.875000 0.68750000 *
           37) TotalIntenCh4>=127285.5 40   9.900000 0.55000000 *
         19) SkewIntenCh4>=1.41772 38   8.552632 0.65789470 *
      5) FiberWidthCh1< 11.19756 345  82.388410 0.60579710  
       10) KurtIntenCh1< -0.3447192 121  28.000000 0.36363640  
         20) TotalIntenCh1>=13594 98  19.561220 0.27551020 *
         21) TotalIntenCh1< 13594 23   4.434783 0.73913040 *
       11) KurtIntenCh1>=-0.3447192 224  43.459820 0.73660710  
         22) AvgIntenCh1>=454.3329 7   0.000000 0.00000000 *
         23) AvgIntenCh1< 454.3329 217  39.539170 0.76036870  
           46) VarIntenCh4< 130.9745 141  31.333330 0.66666670  
             92) NeighborAvgDistCh1>=256.5239 30   6.300000 0.30000000 *
             93) NeighborAvgDistCh1< 256.5239 111  19.909910 0.76576580 *
           47) VarIntenCh4>=130.9745 76   4.671053 0.93421050 *
    3) TotalIntenCh2< 42606.5 622  37.427650 0.93569130  
      6) ShapeP2ACh1< 1.236261 11   2.545455 0.36363640 *
      7) ShapeP2ACh1>=1.236261 611  31.217680 0.94599020 *
>

Note, however, that complexity parameter for rpart(), cp, is set to zero rpart() builds a massive tree, a portion of which is shown below, and over fits the data yielding an AUC of 0.806

1) root 1414 325.2115000 0.64144270  
     2) TotalIntenCh2>=42606.5 792 191.6351000 0.41035350  
       4) FiberWidthCh1>=11.19756 447  85.8970900 0.25950780  
         8) ShapeP2ACh1< 1.225676 155  13.5483900 0.09677419  
          16) EntropyIntenCh1>=6.672119 133   7.5187970 0.06015038  
            32) AngleCh1< 108.6438 82   0.0000000 0.00000000 *
            33) AngleCh1>=108.6438 51   6.7450980 0.15686270  
              66) EqEllipseLWRCh1>=1.184478 26   0.9615385 0.03846154  
               132) DiffIntenDensityCh3>=26.47004 19   0.0000000 0.00000000 *
               133) DiffIntenDensityCh3< 26.47004 7   0.8571429 0.14285710 *
              67) EqEllipseLWRCh1< 1.184478 25   5.0400000 0.28000000  
               134) IntenCoocContrastCh3>=9.637027 9   0.0000000 0.00000000 *
               135) IntenCoocContrastCh3< 9.637027 16   3.9375000 0.43750000 *
          17) EntropyIntenCh1< 6.672119 22   4.7727270 0.31818180  
            34) ShapeBFRCh1>=0.6778205 13   0.0000000 0.00000000 *
            35) ShapeBFRCh1< 0.6778205 9   1.5555560 0.77777780 *
         9) ShapeP2ACh1>=1.225676 292  66.0650700 0.34589040  
          18) SkewIntenCh4< 1.41772 254  53.2598400 0.29921260  
            36) TotalIntenCh4< 127285.5 214  40.3738300 0.25233640  
              72) EqEllipseOblateVolCh1>=383.1453 142  19.9436600 0.16901410  
               144) IntenCoocEntropyCh3< 7.059374 133  16.2857100 0.14285710  
                 288) NeighborMinDistCh1>=21.91001 116  11.5431000 0.11206900  
                   576) NeighborAvgDistCh1>=170.2248 108   8.2500000 0.08333333  
                    1152) FiberAlign2Ch4< 1.481728 68   0.9852941 0.01470588  
                      2304) XCentroid>=100.5 61   0.0000000 0.00000000 *
                      2305) XCentroid< 100.5 7   0.8571429 0.14285710 *
                    1153) FiberAlign2Ch4>=1.481728 40   6.4000000 0.20000000  
                      2306) SkewIntenCh1< 0.9963465 27   1.8518520 0.07407407

In practice, rpart()'s complexity parameter (default value cp = .01) is effective in controlling tree growth and overfitting. It does, however, have an "ad hoc" feel to it. In contrast, the ctree() algorithm implements tests for statistical significance within the process of growing a decision tree. It automatically curtails excessive growth, inherently addresses both overfitting and bias and offers the promise of achieving good models with less computation.

Finally, note that rpart() and ctree() construct different trees that offer about the same performance. Some practitioners who value decision trees for their interpretability find this disconcerting. End users of machine learning models often want at story that tells them something true about their customer's behavior or buying preferences etc. But, the likelihood of there being multiple satisfactory answers to a complex problem is inherent to the process of inverse deduction. As Hothorn et al. comment:

Since a key reason for the popularity of tree based methods stems from their ability to represent the estimated regression relationship in an intuitive way, interpretations drawn from regression trees must be taken with a grain of salt.

1 Hothorn et al. (2006): Unbiased Recursive Partitioning: A Conditional Inference Framework J COMPUT GRAPH STAT Vol(15) No(3) Sept 2006

2. Mingers 1987: Expert Systems-Rule Induction with Statistical Data

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Party with the First Tribe

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)