k-mean clustering + heatmap
[This article was first published on One Tip Per Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
If you want more info about clustering, I have another post about “Clustering analysis and its implementation in R”. Here is the link:
http://onetipperday.blogspot.com/2012/04/clustering-analysis-2.html————
Several R functions in this topic:
1. dist(X) — calculate the distance of rows of data matrix X. The default distance method is euclidean. It can be maximal, manhattan, binary etc.
> a=matrix(sample(9),nrow=3) > a [,1] [,2] [,3] [1,] 5 2 9 [2,] 8 7 1 [3,] 6 4 3 > dist(a, diag=T, method='max') 1 2 3 1 0 2 8 0 3 6 3 0 > dist(a, diag=T, method='euc') 1 2 3 1 0.000000 2 9.899495 0.000000 3 6.403124 4.123106 0.000000
2. hclust(D) — hierarchical clustering of a distance/dissimilarity matrix (e.g output of dist function): join two most similar objects (based on similarity method) each time until there is one single cluster.
hclust(D) can be displayed in a tree format, using plot(hclust(D)), or plclust(hclust(D))
3. heatmap(X, distfun = dist, hclustfun = hclust, …) — display matrix of X and cluster rows/columns by distance and clustering method.
One enhanced version is heatmap.2, which has more functions. For example, you can use
- key, symkey etc. for legend,
- “col=heat.colors(16)” or “col=’greenred’, breaks=16” to specify colors of image
- cellnote (text matrix with same dim), notecex, notecol for text in grid
- colsep/rowsep to define blocks of separation, e.g. colsep=c(1,3,6,8) will display a white separator at columns of 1, 3, 6, 8 etc.
Both have ‘ColSideColors/RowSideColors‘, a color vector with length of cols/rows. Here is an example(http://chromium.liacs.nl/R_users/20060207/Renee_graphs_and_others.pdf).
Another enhanced version is pheatmap, which produced pretty heatmap with additional options:
Another enhanced version is pheatmap, which produced pretty heatmap with additional options:
- cellwidth/cellheight to set the size of cell
- treeheight_row/treeheight_col: height of tree
- annotation: a data.frame, each column is an annotation of columns of X. So, nrow(annotation)==ncol(X)
- legend/annotation_legend: whether to show legend
- filename: save to file
4. kmeans(X, centers=k) — partition points (actually rows of X matrix) into k clusters . For example:
# a 2-dimensional example x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") (cl <- kmeans(x, 2)) plot(x, col = cl$cluster) points(cl$centers, col = 1:2, pch = 8, cex=2)
The number of cluster can be determined by plot of sum of squares, eg.
# Determine number of clusters wss <- (nrow(x)-1)*sum(apply(x,2,var)) for (i in 2:20) wss[i] <- sum(kmeans(x,centers=i)$withinss) plot(1:20, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
Using hclust and cutree can also set the number of clusters:
hc <- hclust(dist(x), "ward") plot(hc) # the plot can also help to decide the # of clusters memb <- cutree(hc, k = 2)
Note: kmean is using partition method to cluster, while hclust is to use hierarchical clustering method. Here is a series of nice lectures for this. A more detail for cluster can be found here: CRAN Task View: Cluster Analysis
To leave a comment for the author, please follow the link and comment on their blog: One Tip Per Day.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.