Clustering using dynamic tree cut

[This article was first published on Pirate Science » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Summary:

Two methods for hierarchical clustering are introduced: (i) dynamic tree cut; and (ii) dynamic hybrid cut.

Dynamic tree cut is a top-down algorithm that relies solely on the dendrogram. The algorithm implements an adaptive, iterative process of cluster decomposition and combination and stops when the number of clusters becomes stable.

Dynamic hybrid cut is a bottom-up algorithm that improves the detection of outlying members of each cluster. This variant can be considered a hybrid of hierarchical clustering and modified partitioning around medoids (PAM; Kaufman and Rousseeuw, 1990).

The protein-protein interaction network in Drosophila (BioGRID) is analyzed with reference to known gene ontologies. Analysis of simulated gene expression data with known cluster membership demonstrates the superiority of the new methods over both static tree cut (fixed height cutoff) and PAM (objects assigned to closest medoid).

Figure 1.  (A) Average linkage hierarchical clustering using the Topological Overlap Matrix (Yip and Horvath, 2007) and the Dynamic Tree cut applied to the protein–protein interaction network of Drosophila (PPI data from BioGRID, www.thebiogrid.org). Module assignment is depicted by the row of color immediately below the dendrogram, with gray representing unassigned proteins. A functional enrichment analysis has shown that the clusters are significantly enriched with known gene ontologies (Dong and Horvath, 2007). Note that a fixed height cutoff would not be able to identify many of the shown clusters. (B) Hierarchical cluster tree and various cluster detection methods applied to a simulated gene expression data set. The color bands below the dendrogram show the cluster membership according to different clustering methods. The color gray is reserved for genes outside any proper cluster, i.e., the tree cut methods allow for unassigned objects. The first color band ‘Simulated’ shows the simulated true cluster membership; color bands ‘Dynamic Hybrid’ and ‘Dynamic Tree’ show the results of the proposed tree cutting methods; the color band ‘Static @ 0.92’ shows the results of the standard, constant height cut-off method at height 0.92. The height refers to the y axis of the dendrogram. The color band ‘PAM 11’ shows the results of k = 11 medoid clustering.

Figure 1. (A) Average linkage hierarchical clustering using the Topological Overlap Matrix (Yip and Horvath, 2007) and the Dynamic Tree cut applied to the protein–protein interaction network of Drosophila (PPI data from BioGRID, www.thebiogrid.org). Module assignment is depicted by the row of color immediately below the dendrogram, with gray representing unassigned proteins. A functional enrichment analysis has shown that the clusters are significantly enriched with known gene ontologies (Dong and Horvath, 2007). Note that a fixed height cutoff would not be able to identify many of the shown clusters. (B) Hierarchical cluster tree and various cluster detection methods applied to a simulated gene expression data set. The color bands below the dendrogram show the cluster membership according to different clustering methods. The color gray is reserved for genes outside any proper cluster, i.e., the tree cut methods allow for unassigned objects. The first color band ‘Simulated’ shows the simulated true cluster membership; color bands ‘Dynamic Hybrid’ and ‘Dynamic Tree’ show the results of the proposed tree cutting methods; the color band ‘Static @ 0.92’ shows the results of the standard, constant height cut-off method at height 0.92. The height refers to the y axis of the dendrogram. The color band ‘PAM 11’ shows the results of k = 11 medoid clustering.

Abstract:

Summary:  Hierarchical clustering is a widely used method for detecting clusters in genomic data. Clusters are defined by cutting branches off the dendrogram. A common but inflexible method uses a constant height cutoff value; this method exhibits suboptimal performance on complicated dendrograms. We present the Dynamic Tree Cut R package that implements novel dynamic branch cutting methods for detecting clusters in a dendrogram depending on their shape. Compared to the constant height cutoff method, our techniques offer the following advantages: (1) they are capable of identifying nested clusters; (2) they are flexible—cluster shape parameters can be tuned to suit the application at hand; (3) they are suitable for automation; and (4) they can optionally combine the advantages of hierarchical clustering and partitioning around medoids, giving better detection of outliers. We illustrate the use of these methods by applying them to protein–protein interaction network data and to a simulated gene expression data set.

Availability:  The Dynamic Tree Cut method is implemented in an R package available at http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/BranchCutting

Contact:  [email protected]

Supplementary information:  Supplementary data are available at Bioinformatics online.

Source:

Peter Langfelder, Bin Zhang, Steve Horvath. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R.  Bioinformatics (2008) 24(5): 719-720 first published online November 16, 2007 doi:10.1093/bioinformatics/btm563

To leave a comment for the author, please follow the link and comment on their blog: Pirate Science » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)