R User Group Recap: Heatmaps and Using the caret Package

April 10, 2015
By

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)

At our most recent R user group meeting we were delighted to have presentations from Mark Lawson and Steve Hoang, both bioinformaticians at Hemoshear. All of the code used in both demos is in our Meetup’s GitHub repo.

Making heatmaps in R

Steve started with an overview of making heatmaps in R. Using the iris dataset, Steve demonstrated making heatmaps of the continuous iris data using the heatmap.2 function from the gplots package, the aheatmap function from NMF, and the hard way using ggplot2. The “best in class” method used aheatmap to draw an annotated heatmap plotting z-scores of columns and annotated rows instead of raw values, using the Pearson correlation instead of Euclidean distance as the distance metric.
library(dplyr)
library(NMF)
library(RColorBrewer)
iris2 = iris # prep iris data for plotting
rownames(iris2) = make.names(iris2$Species, unique = T)
iris2 = iris2 %>% select(-Species) %>% as.matrix()
aheatmap(iris2, color = "-RdBu:50", scale = "col", breaks = 0,
annRow = iris["Species"], annColors = "Set2",
distfun = "pearson", treeheight=c(200, 50),
fontsize=13, cexCol=.7,
filename="heatmap.png", width=8, height=16)

Classification and regression using caret

Mark wrapped up with a gentle introduction to the caret package for classification and regression training. This demonstration used the caret package to split data into training and testing sets, and run repeated cross-validation to train random forest and penalized logistic regression models for classifying Fisher’s iris data.
First, get a look at the data with the featurePlot function in the caret package:
library(caret)
set.seed(42)
data(iris)
featurePlot(x = iris[, 1:4],
y = iris$Species,
plot = "pairs",
auto.key = list(columns = 3))
Next, after splitting the data into training and testing sets and using the caret package to automate training and testing both random forest and partial least squares models using repeated 10-fold cross-validation (see the code), it turns out random forest outperforms PLS in this case, and performs fairly well overall:
setosa versicolor virginica
Sensitivity 1.00 1.00 0.00
Specificity 1.00 0.50 1.00
Pos Pred Value 1.00 0.50 NaN
Neg Pred Value 1.00 1.00 0.67
Prevalence 0.33 0.33 0.33
Detection Rate 0.33 0.33 0.00
Detection Prevalence 0.33 0.67 0.00
Balanced Accuracy 1.00 0.75 0.50
A big thanks to Mark and Steve at Hemoshear for putting this together!

To leave a comment for the author, please follow the link and comment on their blog: Getting Genetics Done.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)