[This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week I dabbled in building classification trees with the party and rpart packages.  Now, I want to put together a series where I can apply those basic trees along with advanced techniques like bagging, boosting and random forest.  Additionally, I’ve taken the leap and have started using the RStudio interface.  I must say I’m quite pleased at the features and once you get comfortable it can be quite a time-saver.  Have a look at http://www.rstudio.com/.

Having worked in oncology (sales rep and market research) in a previous lifetime I wanted to apply the techniques to the Wisconsin Breast Cancer data set. It is available at the UC-Irvine Machine Learning Repository.  Here is the direct link… http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Prognostic%29 .

All I intend to do here in this entry, is get the data loaded, create test and training data sets, build a classification tree model using the “party” package and finally, write the code to be able to produce ROC and Lift charts.  In further iterations, I will create additional tree models on this data set and examine which method is the best at predicting whether a tumor is benign or malignant.

This code drops the annoying ID variable, which is of no use and then eliminates the few missing observations, which are of no consequence.

> #dropping ID variable
> breastcancer$ID = NULL > str(breastcancer) ‘data.frame’: 699 obs. of 10 variables:$ thickness: int  8 6 1 1 1 5 3 3 3 8 …
$Usize : int 4 6 1 1 1 1 1 1 1 8 …$ Ushape   : int  5 6 1 3 2 1 4 1 3 8 …
$adhesion : int 1 9 1 1 1 1 1 1 1 1 …$ SECS     : int  2 6 1 2 3 2 2 2 2 2 …
$nuclei : int NA NA NA NA NA NA NA NA NA NA …$ chromatin: int  7 7 2 2 1 3 3 3 2 6 …
$nucleoli : int 3 8 1 1 1 1 1 1 1 10 …$ mitoses  : int  1 1 1 1 1 1 1 1 1 1 …
class : Factor w/ 2 levels “benign”,”malignant”: 2 1 1 1 1 1 1 1 1 2 … > bcancer = na.omit(breastcancer) # delete missing observations > str(bcancer) ‘data.frame’: 683 obs. of 10 variables: thickness: int  5 8 1 10 7 5 6 5 5 9 …
$Usize : int 4 10 1 7 3 4 10 2 3 4 …$ Ushape   : int  4 10 1 7 2 4 10 3 4 5 …
$adhesion : int 5 8 1 6 10 9 2 1 1 10 …$ SECS     : int  7 7 2 4 5 2 8 6 8 6 …
$nuclei : int 10 10 10 10 10 10 10 10 10 10 …$ chromatin: int  3 9 3 4 5 5 7 5 4 4 …
$nucleoli : int 2 7 1 1 4 6 3 1 9 8 …$ mitoses  : int  1 1 1 2 4 1 3 1 1 1 …
class : Factor w/ 2 levels “benign”,”malignant”: 1 2 1 2 2 2 2 2 2 2 … – attr(*, “na.action”)=Class ‘omit’ Named int [1:16] 1 2 3 4 5 6 7 8 9 10 … .. ..- attr(*, “names”)= chr [1:16] “1” “2” “3” “4” … Now we create the train and test sets… > set.seed(1234) > ind = sample(2,nrow(bcancer), replace=TRUE, prob=c(0.7, 0.3)) > train = bcancer[ind==1,] > test = bcancer[ind==2,] Then, we roll up our sleeves, pour a glass of Bourbon and get to work… > #conditional inference tree with the “party” package > train_tree = ctree(class ~ ., data=train) > table(predict(train_tree), trainclass)

benign    malignant
benign           289           6
malignant         12       165

So, the model says that 177 patients had malignant tumors whereas in reality it was only 171.

Here is a plot of the tree:

> plot(train_tree, type=”simple”)

OK, now it is time to see how we did on the test data…

> testpred = predict(train_tree, newdata=test)
> table(testpred, testclass) testpred benign malignant benign 138 3 malignant 5 65 Quite consistent to what we saw with the train data. Finally, I will start the code where I can compare multiple methods. We first have to gin-up the probabilities for the test set using our model with this annoying bit of code (this will be easier with other methods). > testprobabilities = 1 – unlist(treeresponse(train_tree, newdata=test), use.names=F)[seq(1,nrow(test)*2,2)]

> library(ROCR)  #loading the package to help us calculate and visualize the classifiers

> pred = prediction(test$probabilities, test$class) #create an object of the predicted probabilities

#create and plot the ROC chart with true positive rate and false positive rate
> perf = performance(pred, “tpr”, “fpr”)
> plot(perf, main=”ROC”, colorize=T)

#create and plot the lift chart with lift and rate of positive predictions

> perf_lift = performance(pred, “lift”, “rpp”)
> plot(perf_lift, mail=”lift chart”, colorize=T)

Oh, and for good measure I decided to include the area under the curve…

> auc.curve = performance(pred, “auc”)
> auc.curve
An object of class “performance”
Slot “x.name”:
[1] “None”

Slot “y.name”:
[1] “Area under the ROC curve”

Slot “alpha.name”:
[1] “none”

Slot “x.values”:
list()

Slot “y.values”:
[[1]]
[1] 0.9854484

Next time I will introduce random forest and bagging and/or boosting, based on my further research into the subject.

Cheers,

Meister