k-mean clustering + heatmap

October 10, 2011
By

(This article was first published on One Tip Per Day, and kindly contributed to R-bloggers)


If you want more info about clustering, I have another post about "Clustering analysis and its implementation in R". Here is the link:  
http://onetipperday.blogspot.com/2012/04/clustering-analysis-2.html
------------

Several R functions in this topic:

1. dist(X)  -- calculate the distance of rows of data matrix X. The default distance method is euclidean. It can be maximal, manhattan, binary etc.

> a=matrix(sample(9),nrow=3)
> a
[,1] [,2] [,3]
[1,] 5 2 9
[2,] 8 7 1
[3,] 6 4 3
> dist(a, diag=T, method='max')
1 2 3
1 0
2 8 0
3 6 3 0
 
> dist(a, diag=T, method='euc')
1 2 3
1 0.000000
2 9.899495 0.000000
3 6.403124 4.123106 0.000000
2. hclust(D)  -- hierarchical clustering of a distance/dissimilarity matrix (e.g output of dist function): join two most similar objects (based on similarity method) each time until there is one single cluster.

hclust(D) can be displayed in a tree format, using plot(hclust(D)), or plclust(hclust(D))

3. heatmap(X, distfun = dist, hclustfun = hclust, ...) -- display matrix of X and cluster rows/columns by distance and clustering method.

One enhanced version is heatmap.2, which has more functions. For example, you can use
  • key, symkey etc. for legend, 
  • "col=heat.colors(16)" or "col='greenred', breaks=16" to specify colors of image
  • cellnote (text matrix with same dim), notecex, notecol for text in grid
  • colsep/rowsep to define blocks of separation, e.g. colsep=c(1,3,6,8) will display a white separator at columns of 1, 3, 6, 8 etc.
Both have 'ColSideColors/RowSideColors', a color vector with length of cols/rows. Here is an example(http://chromium.liacs.nl/R_users/20060207/Renee_graphs_and_others.pdf).

Another enhanced version is pheatmap, which produced pretty heatmap with additional options:
  • cellwidth/cellheight to set the size of cell
  • treeheight_row/treeheight_col: height of tree
  • annotation: a data.frame, each column is an annotation of columns of X. So, nrow(annotation)==ncol(X) 
  • legend/annotation_legend: whether to show legend
  • filename: save to file
4. kmeans(X, centers=k) -- partition points (actually rows of X matrix) into k clusters . For example:

# a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, 2))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex=2)
The number of cluster can be determined by plot of sum of squares, eg. 

# Determine number of clusters
wss <- (nrow(x)-1)*sum(apply(x,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(x,centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
Using hclust and cutree can also set the number of clusters:

hc <- hclust(dist(x), "ward")
plot(hc) # the plot can also help to decide the # of clusters
memb <- cutree(hc, k = 2)
Note: kmean is using partition method to cluster, while hclust is to use hierarchical clustering method. Here is a series of nice lectures for this. A more detail for cluster can be found here: CRAN Task View: Cluster Analysis

To leave a comment for the author, please follow the link and comment on his blog: One Tip Per Day.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.