Ensemble Methods, part 1

November 17, 2013
By

(This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers)

Last week I dabbled in building classification trees with the party and rpart packages.  Now, I want to put together a series where I can apply those basic trees along with advanced techniques like bagging, boosting and random forest.  Additionally, I’ve taken the leap and have started using the RStudio interface.  I must say I’m quite pleased at the features and once you get comfortable it can be quite a time-saver.  Have a look at http://www.rstudio.com/.

Having worked in oncology (sales rep and market research) in a previous lifetime I wanted to apply the techniques to the Wisconsin Breast Cancer data set. It is available at the UC-Irvine Machine Learning Repository.  Here is the direct link… http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Prognostic%29 .

All I intend to do here in this entry, is get the data loaded, create test and training data sets, build a classification tree model using the “party” package and finally, write the code to be able to produce ROC and Lift charts.  In further iterations, I will create additional tree models on this data set and examine which method is the best at predicting whether a tumor is benign or malignant.

This code drops the annoying ID variable, which is of no use and then eliminates the few missing observations, which are of no consequence.

> #dropping ID variable
> breastcancer$ID = NULL
> str(breastcancer)
‘data.frame’: 699 obs. of  10 variables:
 $ thickness: int  8 6 1 1 1 5 3 3 3 8 …
 $ Usize    : int  4 6 1 1 1 1 1 1 1 8 …
 $ Ushape   : int  5 6 1 3 2 1 4 1 3 8 …
 $ adhesion : int  1 9 1 1 1 1 1 1 1 1 …
 $ SECS     : int  2 6 1 2 3 2 2 2 2 2 …
 $ nuclei   : int  NA NA NA NA NA NA NA NA NA NA …
 $ chromatin: int  7 7 2 2 1 3 3 3 2 6 …
 $ nucleoli : int  3 8 1 1 1 1 1 1 1 10 …
 $ mitoses  : int  1 1 1 1 1 1 1 1 1 1 …
 $ class    : Factor w/ 2 levels “benign”,”malignant”: 2 1 1 1 1 1 1 1 1 2 …

> bcancer = na.omit(breastcancer)  # delete missing observations
> str(bcancer)
‘data.frame’: 683 obs. of  10 variables:
 $ thickness: int  5 8 1 10 7 5 6 5 5 9 …
 $ Usize    : int  4 10 1 7 3 4 10 2 3 4 …
 $ Ushape   : int  4 10 1 7 2 4 10 3 4 5 …
 $ adhesion : int  5 8 1 6 10 9 2 1 1 10 …
 $ SECS     : int  7 7 2 4 5 2 8 6 8 6 …
 $ nuclei   : int  10 10 10 10 10 10 10 10 10 10 …
 $ chromatin: int  3 9 3 4 5 5 7 5 4 4 …
 $ nucleoli : int  2 7 1 1 4 6 3 1 9 8 …
 $ mitoses  : int  1 1 1 2 4 1 3 1 1 1 …
 $ class    : Factor w/ 2 levels “benign”,”malignant”: 1 2 1 2 2 2 2 2 2 2 …
 – attr(*, “na.action”)=Class ‘omit’  Named int [1:16] 1 2 3 4 5 6 7 8 9 10 …
  .. ..- attr(*, “names”)= chr [1:16] “1” “2” “3” “4” …

Now we create the train and test sets…

> set.seed(1234)
> ind = sample(2,nrow(bcancer), replace=TRUE, prob=c(0.7, 0.3))
> train = bcancer[ind==1,]
> test = bcancer[ind==2,]

Then, we roll up our sleeves, pour a glass of Bourbon and get to work…

> #conditional inference tree with the “party” package
> train_tree = ctree(class ~ ., data=train)
> table(predict(train_tree), train$class)
         
                     benign    malignant
  benign           289           6
  malignant         12       165

So, the model says that 177 patients had malignant tumors whereas in reality it was only 171.

Here is a plot of the tree:

> plot(train_tree, type=”simple”)

OK, now it is time to see how we did on the test data…

> testpred = predict(train_tree, newdata=test)
> table(testpred, test$class)
         
testpred       benign      malignant
  benign          138             3
  malignant          5           65

Quite consistent to what we saw with the train data.

Finally, I will start the code where I can compare multiple methods.  We first have to gin-up the probabilities for the test set using our model with this annoying bit of code (this will be easier with other methods).

> test$probabilities = 1 – unlist(treeresponse(train_tree, newdata=test), use.names=F)[seq(1,nrow(test)*2,2)]

> library(ROCR)  #loading the package to help us calculate and visualize the classifiers

> pred = prediction(test$probabilities, test$class) #create an object of the predicted probabilities

#create and plot the ROC chart with true positive rate and false positive rate
> perf = performance(pred, “tpr”, “fpr”)
> plot(perf, main=”ROC”, colorize=T)

#create and plot the lift chart with lift and rate of positive predictions

> perf_lift = performance(pred, “lift”, “rpp”)
> plot(perf_lift, mail=”lift chart”, colorize=T)

Oh, and for good measure I decided to include the area under the curve…

> auc.curve = performance(pred, “auc”)
> auc.curve
An object of class “performance”
Slot “x.name”:
[1] “None”

Slot “y.name”:
[1] “Area under the ROC curve”

Slot “alpha.name”:
[1] “none”

Slot “x.values”:
list()

Slot “y.values”:
[[1]]
[1] 0.9854484

Next time I will introduce random forest and bagging and/or boosting, based on my further research into the subject.

Cheers,

Meister

To leave a comment for the author, please follow the link and comment on their blog: Fear and Loathing in Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)