Ensemble Methods, part 1

[This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week I dabbled in building classification trees with the party and rpart packages.  Now, I want to put together a series where I can apply those basic trees along with advanced techniques like bagging, boosting and random forest.  Additionally, I’ve taken the leap and have started using the RStudio interface.  I must say I’m quite pleased at the features and once you get comfortable it can be quite a time-saver.  Have a look at http://www.rstudio.com/.
















Having worked in oncology (sales rep and market research) in a previous lifetime I wanted to apply the techniques to the Wisconsin Breast Cancer data set. It is available at the UC-Irvine Machine Learning Repository.  Here is the direct link… http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Prognostic%29 .

All I intend to do here in this entry, is get the data loaded, create test and training data sets, build a classification tree model using the “party” package and finally, write the code to be able to produce ROC and Lift charts.  In further iterations, I will create additional tree models on this data set and examine which method is the best at predicting whether a tumor is benign or malignant.

This code drops the annoying ID variable, which is of no use and then eliminates the few missing observations, which are of no consequence.

> #dropping ID variable
> breastcancer$ID = NULL
> str(breastcancer)
‘data.frame’: 699 obs. of  10 variables:
 $ thickness: int  8 6 1 1 1 5 3 3 3 8 …
 $ Usize    : int  4 6 1 1 1 1 1 1 1 8 …
 $ Ushape   : int  5 6 1 3 2 1 4 1 3 8 …
 $ adhesion : int  1 9 1 1 1 1 1 1 1 1 …
 $ SECS     : int  2 6 1 2 3 2 2 2 2 2 …
 $ nuclei   : int  NA NA NA NA NA NA NA NA NA NA …
 $ chromatin: int  7 7 2 2 1 3 3 3 2 6 …
 $ nucleoli : int  3 8 1 1 1 1 1 1 1 10 …
 $ mitoses  : int  1 1 1 1 1 1 1 1 1 1 …
 $ class    : Factor w/ 2 levels “benign”,”malignant”: 2 1 1 1 1 1 1 1 1 2 …

> bcancer = na.omit(breastcancer)  # delete missing observations
> str(bcancer)
‘data.frame’: 683 obs. of  10 variables:
 $ thickness: int  5 8 1 10 7 5 6 5 5 9 …
 $ Usize    : int  4 10 1 7 3 4 10 2 3 4 …
 $ Ushape   : int  4 10 1 7 2 4 10 3 4 5 …
 $ adhesion : int  5 8 1 6 10 9 2 1 1 10 …
 $ SECS     : int  7 7 2 4 5 2 8 6 8 6 …
 $ nuclei   : int  10 10 10 10 10 10 10 10 10 10 …
 $ chromatin: int  3 9 3 4 5 5 7 5 4 4 …
 $ nucleoli : int  2 7 1 1 4 6 3 1 9 8 …
 $ mitoses  : int  1 1 1 2 4 1 3 1 1 1 …
 $ class    : Factor w/ 2 levels “benign”,”malignant”: 1 2 1 2 2 2 2 2 2 2 …
 – attr(*, “na.action”)=Class ‘omit’  Named int [1:16] 1 2 3 4 5 6 7 8 9 10 …
  .. ..- attr(*, “names”)= chr [1:16] “1” “2” “3” “4” …

Now we create the train and test sets…

> set.seed(1234)
> ind = sample(2,nrow(bcancer), replace=TRUE, prob=c(0.7, 0.3))
> train = bcancer[ind==1,]
> test = bcancer[ind==2,]

Then, we roll up our sleeves, pour a glass of Bourbon and get to work…

> #conditional inference tree with the “party” package
> train_tree = ctree(class ~ ., data=train)
> table(predict(train_tree), train$class)
         
                     benign    malignant
  benign           289           6
  malignant         12       165

So, the model says that 177 patients had malignant tumors whereas in reality it was only 171.

Here is a plot of the tree:

> plot(train_tree, type=”simple”)



















OK, now it is time to see how we did on the test data…

> testpred = predict(train_tree, newdata=test)
> table(testpred, test$class)
         
testpred       benign      malignant
  benign          138             3
  malignant          5           65

Quite consistent to what we saw with the train data.

Finally, I will start the code where I can compare multiple methods.  We first have to gin-up the probabilities for the test set using our model with this annoying bit of code (this will be easier with other methods).

> test$probabilities = 1 – unlist(treeresponse(train_tree, newdata=test), use.names=F)[seq(1,nrow(test)*2,2)]

> library(ROCR)  #loading the package to help us calculate and visualize the classifiers

> pred = prediction(test$probabilities, test$class) #create an object of the predicted probabilities

#create and plot the ROC chart with true positive rate and false positive rate
> perf = performance(pred, “tpr”, “fpr”)
> plot(perf, main=”ROC”, colorize=T)


















#create and plot the lift chart with lift and rate of positive predictions

> perf_lift = performance(pred, “lift”, “rpp”)
> plot(perf_lift, mail=”lift chart”, colorize=T)



















Oh, and for good measure I decided to include the area under the curve…

> auc.curve = performance(pred, “auc”)
> auc.curve
An object of class “performance”
Slot “x.name”:
[1] “None”

Slot “y.name”:
[1] “Area under the ROC curve”

Slot “alpha.name”:
[1] “none”

Slot “x.values”:
list()

Slot “y.values”:
[[1]]
[1] 0.9854484

Next time I will introduce random forest and bagging and/or boosting, based on my further research into the subject.

Cheers,

Meister

To leave a comment for the author, please follow the link and comment on their blog: Fear and Loathing in Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)