# finding meaningful clusters in phylogenetic trees or other hierarchical clusterings

**Pirate Science » R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Phylogenetic trees are a specialization of hierarchical clustering which elegantly capture relatedness between observations, grouping like with like. Yet hierarchical clusterings have one common complaint, as compared to density/distribution based clustering, the ability to classify the data into different types. Finding a meaningful cut of the tree into subtrees remains an open question in the field

The prolific Mattias Prosperi (9 publications in the first 9 months of 2012) has proposed a method for automatically partitioning phylogenetic trees of pathogens into transmission clusters. The intuition is that a group of pathogens represent a transmission cluster if their sequences are monophyletic and more closely related than those from two randomly selected individuals.

To capture this intuition, he proposes a depth first search of the phylogeny. A node is classified as a cluster if the median pairwise patristic distance (MPPD) between the nodes of its subtree is below a threshold. Since phylogenetic trees often come with some measurable uncertainty, he further suggests that the node should have high reliability (bootstrap support, SH test, …).

So let’s do it. Fast and dirty. In R. *[The code is available on Figshare, along with sample data.]*

We’ll need the ape and geiger packages to work with the phylogenies, and igraph makes it easy to get the nodes in depth first search order.

1 2 3 | require(ape, quietly=TRUE) require(geiger, quietly=TRUE) require(igraph, quietly=TRUE) |

Notice that the algorithm thresholds based on the MPPD, and that identifying a good threshold may require some tuning. We want the MPPD for the subtree of each internal node. Sapply over the vector of nodes? Hardly efficient, completely ignores the tree structure, but fast to code correctly. I did mention fast and dirty, didn’t I? Which also excuses dumping the results into the global environment rather than using encapsulation. If the tree has more than ~ 150 nodes, it is well worth precomputing this vector.

Prosperi suggests the MPPD of ALL nodes in the subtree, but one could alternately consider only leaf nodes.

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | ## Given ## a node, tree, and distance matrix ## Return ## median pairwise patristic distance (MPPD) of its leaves get.node.leaf.MPPD <- function(node,tree,distmat){ nlist <- node.leaves(tree,node) foo <- distmat[nlist,nlist] return(median(foo[upper.tri(foo,diag=FALSE)])) } ## Given ## a node, tree, and distance matrix ## Return ## median pairwise patristic distance (MPPD) of all of its decendants get.node.full.MPPD <- function(node,tree,distmat){ nlist <- node.leaves(tree,node) elist <- tree$edge[which.edge(tree,nlist),2] foo <- distmat[elist,elist] return(median(foo[upper.tri(foo,diag=FALSE)])) } |

and a helper function to call them:

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | ## Given ## a tree and (optionally) a distance matrix ## Return ## a vector giving the median pairwise ## patristic distance of the subtree under ## each internal node ## SLOW!! May be a good idea to save/cache results pdist.clusttree <- function(tree,distmat=NULL,mode=c('leaf','all')){ mode <- match.arg(mode) if(is.null(distmat)){ if(mode=='leaf'){ distmat <- cophenetic.phylo(tree) } else{ distmat <- dist.nodes(tree) } } ntips<- Ntip(tree) nint <- tree$Nnode ## number of internal nodes if(mode=='leaf'){ return(sapply((ntips+1):(ntips+nint),get.node.leaf.MPPD,tree,distmat)) } else{ return(sapply((ntips+1):(ntips+nint),get.node.full.MPPD,tree,distmat)) } } |

The actual clustering is done as follows:

46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | ## Given ## a tree and a threshold, and (optionally) ## a vector of the MPPD of the subtree at each internal node, ## the reliability threshold and vector, ## and specify if you want the membership for all nodes or ## only the tips ## Return ## a vector indicating, for each internal node, which ## cluster it belongs to prosperi.cluster <- function(tree,thresh,distvec=NULL, rthresh=NULL,reliabilityvec=NULL, retval=c("tips","all")){ ## take care of optional arguments retval <- match.arg(retval) if(is.null(distvec)){ cat("Calculating MPPD for each node ...\n") distvec <- pdist.clusttree(tree,mode='all') } if(is.null(rthresh) || is.null(reliabilityvec)){ reliabilityvec=NULL cat("No reliability thresholding.\n") } ## set up clustering ntips<-Ntip(tree) cnum <- 0 ## cluster number assign <- rep(0,ntips+tree$Nnode) ## cluster assignment igraph.tree <- graph.edgelist(tree$edge) ## tree in igraph form dfs <- graph.dfs(igraph.tree,root=ntips+1,neimode='out', order=TRUE,dist=TRUE) ## travese the tree in depth first order for(i in 1:length(dfs$order)){ node <- dfs$order[i] ## skip leaves if(node < ntips+1){ next } ## skip unreliable nodes (if reliability measure is available) if(! is.null(reliabilityvec) && reliabilityvec[node-ntips] >= rthresh){ next } ## If the node's subtree is below the threshold, mark it and ## its subtree as members of a new cluster if(distvec[node-ntips]<=thresh && assign[node]<=0){ cnum <- cnum+1 subtree <- graph.dfs(igraph.tree,node, neimode='out',unreachable=FALSE)$order subtree <- subtree[! is.nan(subtree)] assign[subtree] <- cnum }} ans <- list(membership=assign,allcsize=table(assign), leafclustsize=table(assign[1:ntips]), ntips=ntips,threshold=thresh) if(retval=="tips"){ans$membership <- ans$membership[1:ntips]} class(ans) <- c(class(ans),'p.cluster') return(ans) } |

And what is the point of it all without a colorful graphic?

99 100 101 102 103 104 105 106 107 | testplot <- function(testtree,thresh,plotfile=NULL){ graphics.off() clustering <- prosperi.cluster(testtree,thresh)$membership if(class(plotfile)=="character"){ png(plotfile) } plot(testtree,show.tip.label=FALSE) tiplabels(rep(" ",Ntip(testtree)), bg=clustering) # better colors if you use RColorBrewer if(class(plotfile)=="character"){ dev.off() } } |

To test it out, we need some data. You can grab a nice tree from the figshare site if you don't have your own handy.

108 | samp <- read.tree('sampbalanced.tre') |

where to threshold?

109 110 111 | quantile(dist.nodes(samp),seq(0.1,0.5,by=0.05)) ## 10% 15% 20% 25% 30% 35% 40% 45% ##0.4643290 0.5712468 0.6980250 0.8462455 1.0000370 1.1277150 1.2418286 1.3255760 |

suggests that 0.46 would restrict clusters to tips which are more closely related than those from 90% of randomly selected individuals. If one wants to be generous, a threshold of 1 clusters individuals more similar than 70% of the population. Here is what it looks like:

112 113 | testplot(samp,0.46) testplot(samp,1.0) |

In the example so far, the tree is small enough that we did not need to precompute the MPPD vector. For a larger tree, however, you will want to do so. Assume (for kicks) that the tree also has the bootstrap support value of each internal node stored as a node label for the tree.

114 115 116 117 118 119 | bigtree <- read.tree("aBigTree.tre") bigtree.mppd <- pdist.clusttree(bigtree,mode="all") bigtree.support <- as.numeric(bigtree$node.label) bigtree.support[is.na(bigtree.support)] <- 0 clustering <- prosperi.cluster(bigtree,0.5,distvec=bigtree.mppd, rthresh=0.90,reliabilityvec=bigtree.support) |

Analyze the clustering as you like (perhaps by making a table of the membership values, perhaps by plotting) to see if the threshold is reasonable.

Source paper for the algorithm:

Prosperi MC, Ciccozzi M, Fanti I, Saladini F, Pecorari M, Borghi V, Di Giambenedetto S, Bruzzone B, Capetti A, Vivarelli A, Rusconi S, Re MC, Gismondo MR, Sighinolfi L, Gray RR, Salemi M, Zazzi M, De Luca A, & ARCA collaborative group (2011). A novel methodology for large-scale phylogeny partition. Nature communications, 2 PMID: 21610724

**Abstract**

Understanding the determinants of virus transmission is a fundamental step for effective design of screening and intervention strategies to control viral epidemics. Phylogenetic analysis can be a valid approach for the identification of transmission chains, and very-large data sets can be analysed through parallel computation. Here we propose and validate a new methodology for the partition of large-scale phylogenies and the inference of transmission clusters. This approach, on the basis of a depth-first search algorithm, conjugates the evaluation of node reliability, tree topology and patristic distance analysis. The method has been applied to identify transmission clusters of a phylogeny of 11,541 human immunodeficiency virus-1 subtype B pol gene sequences from a large Italian cohort. Molecular transmission chains were characterized by means of different clinical/demographic factors, such as the interaction between male homosexuals and male heterosexuals. Our method takes an advantage of a flexible notion of transmission cluster and can become a general framework to analyse other epidemics.

**leave a comment**for the author, please follow the link and comment on their blog:

**Pirate Science » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.