Decision making trees and machine learning resources for R

April 30, 2014

(This article was first published on me nugget, and kindly contributed to R-bloggers)

I have recently come across Ricky Ho's blog "Pragmatic Programming Techniques", which seems to be excellent resource for all sorts of aspects regarding data exploration and predictive modelling. The post "Six steps in data science" provides a nice overview to some of the topics covered in the blog. For some reason, this blog does not seem to be listed on R-Bloggers (attn: Tal Galili!).

I was drawn to the page from my interest in understanding Classification and Regression Tree (CART) models, and quickly became amazed by the blog's nice review of some available methods. I was specifically looking for a example that uses Edgar Anderson's iris data set as I find it to be a very understandable example for this type of problem – i.e. Can we develop a model to predict the iris species based on it's morphological characteristics?

Below is a sample of just two methods that were presented using the rpart and randomForest packages. rpart is a referred to as a "Decision Tree" method, while randomForest is an example of a "Tree Ensemble" method. The blog explains many of the pros and cons for each method, and a further post show even further examples of predictive analytics, including Neural Network, Support Vector Machine, Naive Bayes and Nearest Neighbor approaches (all using the iris data set). I would love to know a bit more about the comparative predictive powers of each of these methods. For the meantime, the example below shows a cross-validation comparison of prediction accuracy for the rpart and randomForest methods using 100 permutations. Half the data set is used as the training set and the other half is used as the validation set.

The results show a slight improvement in accuracy for the randomForest method, especially for the species versicolor and virginica – which are more similar in morphology. This can be see in the degree of overlap in the plot of the first 2 principle components (explaining ~98% of the variance):

Example code:
Read more »

To leave a comment for the author, please follow the link and comment on their blog: me nugget. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)