Exploring phylogenetic tree balance metrics

[This article was first published on Recology - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I need to simulate balanced and unbalanced phylogenetic trees for some research I am doing. In order to do this, I do rejection sampling: simulate a tree -> measure tree shape -> reject if not balanced or unbalanced enough. But what is enough? We need to define some cutoff value to determine what will be our set of balanced and unbalanced trees.

A function to calculate shape metrics, and a custom theme for plottingn phylogenies.

foo <span class="o"><-</span> <span class="kr">function</span><span class="p">(</span>x<span class="p">,</span> metric <span class="o">=</span> <span class="s">"colless"</span><span class="p">)</span> <span class="p">{</span>
    <span class="kr">if</span> <span class="p">(</span>metric <span class="o">==</span> <span class="s">"colless"</span><span class="p">)</span> <span class="p">{</span>
        xx <span class="o"><-</span> as.treeshape<span class="p">(</span>x<span class="p">)</span>  <span class="c1"># convert to apTreeshape format</span>
        colless<span class="p">(</span>xx<span class="p">,</span> <span class="s">"yule"</span><span class="p">)</span>  <span class="c1"># calculate colless' metric</span>
    <span class="p">}</span> <span class="kr">else</span> <span class="kr">if</span> <span class="p">(</span>metric <span class="o">==</span> <span class="s">"gamma"</span><span class="p">)</span> <span class="p">{</span>
        gammaStat<span class="p">(</span>x<span class="p">)</span>
    <span class="p">}</span> <span class="kr">else</span> stop<span class="p">(</span><span class="s">"metric should be one of colless or gamma"</span><span class="p">)</span>
<span class="p">}</span>

theme_myblank <span class="o"><-</span> <span class="kr">function</span><span class="p">()</span> <span class="p">{</span>
    stopifnot<span class="p">(</span>require<span class="p">(</span>ggplot2<span class="p">))</span>
    theme_blank <span class="o"><-</span> ggplot2<span class="p">::</span>theme_blank
    ggplot2<span class="p">::</span>theme<span class="p">(</span>panel.grid.major <span class="o">=</span> element_blank<span class="p">(),</span> panel.grid.minor <span class="o">=</span> element_blank<span class="p">(),</span> 
        panel.background <span class="o">=</span> element_blank<span class="p">(),</span> plot.background <span class="o">=</span> element_blank<span class="p">(),</span> 
        axis.title.x <span class="o">=</span> element_text<span class="p">(</span>colour <span class="o">=</span> <span class="kc">NA</span><span class="p">),</span> axis.title.y <span class="o">=</span> element_blank<span class="p">(),</span> 
        axis.text.x <span class="o">=</span> element_blank<span class="p">(),</span> axis.text.y <span class="o">=</span> element_blank<span class="p">(),</span> axis.line <span class="o">=</span> element_blank<span class="p">(),</span> 
        axis.ticks <span class="o">=</span> element_blank<span class="p">())</span>
<span class="p">}</span>

Simulate some trees

library<span class="p">(</span>ape<span class="p">)</span>
library<span class="p">(</span>phytools<span class="p">)</span>

numtrees <span class="o"><-</span> <span class="m">1000</span>  <span class="c1"># lets simulate 1000 trees</span>
trees <span class="o"><-</span> pbtree<span class="p">(</span>n <span class="o">=</span> <span class="m">50</span><span class="p">,</span> nsim <span class="o">=</span> numtrees<span class="p">,</span> ape <span class="o">=</span> <span class="k-Variable">F</span><span class="p">)</span>  <span class="c1"># simulate 500 pure-birth trees with 100 spp each, ape = F makes it run faster</span>

Calculate Colless’ shape metric on each tree

library<span class="p">(</span>plyr<span class="p">)</span>
library<span class="p">(</span>apTreeshape<span class="p">)</span>

colless_df <span class="o"><-</span> ldply<span class="p">(</span>trees<span class="p">,</span> foo<span class="p">,</span> metric <span class="o">=</span> <span class="s">"colless"</span><span class="p">)</span>  <span class="c1"># calculate metric for each tree</span>
head<span class="p">(</span>colless_df<span class="p">)</span>
       V1
1 -0.1761
2  0.2839
3  0.4639
4  0.9439
5 -0.6961
6 -0.1161
<span class="c1"># Calculate the percent of trees that will fall into the cutoff for balanced and unbalanced trees</span>
col_percent_low <span class="o"><-</span> round<span class="p">(</span>length<span class="p">(</span>colless_df<span class="p">[</span>colless_df<span class="p">$</span>V1 <span class="o"><</span> <span class="m">-0.7</span><span class="p">,</span> <span class="s">"V1"</span><span class="p">])</span><span class="o">/</span>numtrees<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span>
col_percent_high <span class="o"><-</span> round<span class="p">(</span>length<span class="p">(</span>colless_df<span class="p">[</span>colless_df<span class="p">$</span>V1 <span class="o">></span> <span class="m">0.7</span><span class="p">,</span> <span class="s">"V1"</span><span class="p">])</span><span class="o">/</span>numtrees<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span>

Create a distribution of the metric values

library<span class="p">(</span>ggplot2<span class="p">)</span>

a <span class="o"><-</span> ggplot<span class="p">(</span>colless_df<span class="p">,</span> aes<span class="p">(</span>V1<span class="p">))</span> <span class="o">+</span>  <span class="c1"># plot histogram of distribution of values</span>
    geom_histogram<span class="p">()</span> <span class="o">+</span> 
    theme_bw<span class="p">(</span>base_size<span class="o">=</span><span class="m">18</span><span class="p">)</span> <span class="o">+</span> 
    scale_x_continuous<span class="p">(</span>limits<span class="o">=</span>c<span class="p">(</span><span class="m">-3</span><span class="p">,</span><span class="m">3</span><span class="p">),</span> breaks<span class="o">=</span>c<span class="p">(</span><span class="m">-3</span><span class="p">,</span><span class="m">-2</span><span class="p">,</span><span class="m">-1</span><span class="p">,</span><span class="m">0</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="m">3</span><span class="p">))</span> <span class="o">+</span> 
    geom_vline<span class="p">(</span>xintercept <span class="o">=</span> <span class="m">-0.7</span><span class="p">,</span> colour<span class="o">=</span><span class="s">"red"</span><span class="p">,</span> linetype <span class="o">=</span> <span class="s">"longdash"</span><span class="p">)</span> <span class="o">+</span>
    geom_vline<span class="p">(</span>xintercept <span class="o">=</span> <span class="m">0.7</span><span class="p">,</span> colour<span class="o">=</span><span class="s">"red"</span><span class="p">,</span> linetype <span class="o">=</span> <span class="s">"longdash"</span><span class="p">)</span> <span class="o">+</span>
    ggtitle<span class="p">(</span>paste0<span class="p">(</span><span class="s">"Distribution of Colless' metric for 1000 trees, cutoffs at -0.7 and 0.7 results in\n "</span><span class="p">,</span> col_percent_low<span class="p">,</span> <span class="s">"% ("</span><span class="p">,</span> numtrees<span class="o">*</span><span class="p">(</span>col_percent_low<span class="o">/</span><span class="m">100</span><span class="p">),</span> <span class="s">") 'balanced' trees (left) and "</span><span class="p">,</span> col_percent_low<span class="p">,</span> <span class="s">"% ("</span><span class="p">,</span> numtrees<span class="o">*</span><span class="p">(</span>col_percent_low<span class="o">/</span><span class="m">100</span><span class="p">),</span> <span class="s">") 'unbalanced' trees (right)"</span><span class="p">))</span> <span class="o">+</span>  
    labs<span class="p">(</span>x <span class="o">=</span> <span class="s">"Colless' Metric Value"</span><span class="p">,</span> y <span class="o">=</span> <span class="s">"Number of phylogenetic trees"</span><span class="p">)</span> <span class="o">+</span>
    theme<span class="p">(</span>plot.title  <span class="o">=</span> element_text<span class="p">(</span>size <span class="o">=</span> <span class="m">16</span><span class="p">))</span>

a

center

Create phylogenies representing balanced and unbalanced trees (using the custom theme)

library<span class="p">(</span>ggphylo<span class="p">)</span>

b <span class="o"><-</span> ggphylo<span class="p">(</span>trees<span class="p">[</span>which.min<span class="p">(</span>colless_df<span class="p">$</span>V1<span class="p">)],</span> do.plot <span class="o">=</span> <span class="k-Variable">F</span><span class="p">)</span> <span class="o">+</span> theme_myblank<span class="p">()</span>
c <span class="o"><-</span> ggphylo<span class="p">(</span>trees<span class="p">[</span>which.max<span class="p">(</span>colless_df<span class="p">$</span>V1<span class="p">)],</span> do.plot <span class="o">=</span> <span class="k-Variable">F</span><span class="p">)</span> <span class="o">+</span> theme_myblank<span class="p">()</span>

b

center

Now, put it all together in one plot using some gridExtra magic.

library<span class="p">(</span>gridExtra<span class="p">)</span>

grid.newpage<span class="p">()</span>
pushViewport<span class="p">(</span>viewport<span class="p">(</span>layout <span class="o">=</span> grid.layout<span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">1</span><span class="p">)))</span>
vpa_ <span class="o"><-</span> viewport<span class="p">(</span>width <span class="o">=</span> <span class="m">1</span><span class="p">,</span> height <span class="o">=</span> <span class="m">1</span><span class="p">,</span> x <span class="o">=</span> <span class="m">0.5</span><span class="p">,</span> y <span class="o">=</span> <span class="m">0.49</span><span class="p">)</span>
vpb_ <span class="o"><-</span> viewport<span class="p">(</span>width <span class="o">=</span> <span class="m">0.35</span><span class="p">,</span> height <span class="o">=</span> <span class="m">0.35</span><span class="p">,</span> x <span class="o">=</span> <span class="m">0.23</span><span class="p">,</span> y <span class="o">=</span> <span class="m">0.7</span><span class="p">)</span>
vpc_ <span class="o"><-</span> viewport<span class="p">(</span>width <span class="o">=</span> <span class="m">0.35</span><span class="p">,</span> height <span class="o">=</span> <span class="m">0.35</span><span class="p">,</span> x <span class="o">=</span> <span class="m">0.82</span><span class="p">,</span> y <span class="o">=</span> <span class="m">0.7</span><span class="p">)</span>
print<span class="p">(</span>a<span class="p">,</span> vp <span class="o">=</span> vpa_<span class="p">)</span>
print<span class="p">(</span>b<span class="p">,</span> vp <span class="o">=</span> vpb_<span class="p">)</span>
print<span class="p">(</span>c<span class="p">,</span> vp <span class="o">=</span> vpc_<span class="p">)</span>

center

And the same for Gamma stat, which measures the distribution of nodes in time.

gamma_df <span class="o"><-</span> ldply<span class="p">(</span>trees<span class="p">,</span> foo<span class="p">,</span> metric<span class="o">=</span><span class="s">"gamma"</span><span class="p">)</span> <span class="c1"># calculate metric for each tree</span>
gam_percent_low <span class="o"><-</span> round<span class="p">(</span>length<span class="p">(</span>gamma_df<span class="p">[</span>gamma_df<span class="p">$</span>V1 <span class="o"><</span> <span class="m">-1</span><span class="p">,</span> <span class="s">"V1"</span><span class="p">])</span><span class="o">/</span>numtrees<span class="p">,</span> <span class="m">2</span><span class="p">)</span><span class="o">*</span><span class="m">100</span>
gam_percent_high <span class="o"><-</span> round<span class="p">(</span>length<span class="p">(</span>gamma_df<span class="p">[</span>gamma_df<span class="p">$</span>V1 <span class="o">></span> <span class="m">1</span><span class="p">,</span> <span class="s">"V1"</span><span class="p">])</span><span class="o">/</span>numtrees<span class="p">,</span> <span class="m">2</span><span class="p">)</span><span class="o">*</span><span class="m">100</span>
a <span class="o"><-</span> ggplot<span class="p">(</span>gamma_df<span class="p">,</span> aes<span class="p">(</span>V1<span class="p">))</span> <span class="o">+</span>  <span class="c1"># plot histogram of distribution of values</span>
    geom_histogram<span class="p">()</span> <span class="o">+</span> 
    theme_bw<span class="p">(</span>base_size<span class="o">=</span><span class="m">18</span><span class="p">)</span> <span class="o">+</span> 
    scale_x_continuous<span class="p">(</span>breaks<span class="o">=</span>c<span class="p">(</span><span class="m">-3</span><span class="p">,</span><span class="m">-2</span><span class="p">,</span><span class="m">-1</span><span class="p">,</span><span class="m">0</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="m">3</span><span class="p">))</span> <span class="o">+</span> 
    geom_vline<span class="p">(</span>xintercept <span class="o">=</span> <span class="m">-1</span><span class="p">,</span> colour<span class="o">=</span><span class="s">"red"</span><span class="p">,</span> linetype <span class="o">=</span> <span class="s">"longdash"</span><span class="p">)</span> <span class="o">+</span>
    geom_vline<span class="p">(</span>xintercept <span class="o">=</span> <span class="m">1</span><span class="p">,</span> colour<span class="o">=</span><span class="s">"red"</span><span class="p">,</span> linetype <span class="o">=</span> <span class="s">"longdash"</span><span class="p">)</span> <span class="o">+</span>
    ggtitle<span class="p">(</span>paste0<span class="p">(</span><span class="s">"Distribution of Gamma metric for 1000 trees, cutoffs at -1 and 1 results in\n "</span><span class="p">,</span> gam_percent_low<span class="p">,</span> <span class="s">"% ("</span><span class="p">,</span> numtrees<span class="o">*</span><span class="p">(</span>gam_percent_low<span class="o">/</span><span class="m">100</span><span class="p">),</span> <span class="s">") trees with deeper nodes (left) and "</span><span class="p">,</span> gam_percent_high<span class="p">,</span> <span class="s">"% ("</span><span class="p">,</span> numtrees<span class="o">*</span><span class="p">(</span>gam_percent_high<span class="o">/</span><span class="m">100</span><span class="p">),</span> <span class="s">") trees with shallower nodes (right)"</span><span class="p">))</span> <span class="o">+</span>  
    labs<span class="p">(</span>x <span class="o">=</span> <span class="s">"Gamma Metric Value"</span><span class="p">,</span> y <span class="o">=</span> <span class="s">"Number of phylogenetic trees"</span><span class="p">)</span> <span class="o">+</span>
    theme<span class="p">(</span>plot.title  <span class="o">=</span> element_text<span class="p">(</span>size <span class="o">=</span> <span class="m">16</span><span class="p">))</span>
b <span class="o"><-</span> ggphylo<span class="p">(</span>trees<span class="p">[</span>which.min<span class="p">(</span>gamma_df<span class="p">$</span>V1<span class="p">)],</span> do.plot<span class="o">=</span><span class="k-Variable">F</span><span class="p">)</span> <span class="o">+</span> theme_myblank<span class="p">()</span>
c <span class="o"><-</span> ggphylo<span class="p">(</span>trees<span class="p">[</span>which.max<span class="p">(</span>gamma_df<span class="p">$</span>V1<span class="p">)],</span> do.plot<span class="o">=</span><span class="k-Variable">F</span><span class="p">)</span> <span class="o">+</span> theme_myblank<span class="p">()</span>

grid.newpage<span class="p">()</span>
pushViewport<span class="p">(</span>viewport<span class="p">(</span>layout <span class="o">=</span> grid.layout<span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">1</span><span class="p">)))</span>
vpa_ <span class="o"><-</span> viewport<span class="p">(</span>width <span class="o">=</span> <span class="m">1</span><span class="p">,</span> height <span class="o">=</span> <span class="m">1</span><span class="p">,</span> x <span class="o">=</span> <span class="m">0.5</span><span class="p">,</span> y <span class="o">=</span> <span class="m">0.49</span><span class="p">)</span>
vpb_ <span class="o"><-</span> viewport<span class="p">(</span>width <span class="o">=</span> <span class="m">0.35</span><span class="p">,</span> height <span class="o">=</span> <span class="m">0.35</span><span class="p">,</span> x <span class="o">=</span> <span class="m">0.23</span><span class="p">,</span> y <span class="o">=</span> <span class="m">0.7</span><span class="p">)</span>
vpc_ <span class="o"><-</span> viewport<span class="p">(</span>width <span class="o">=</span> <span class="m">0.35</span><span class="p">,</span> height <span class="o">=</span> <span class="m">0.35</span><span class="p">,</span> x <span class="o">=</span> <span class="m">0.82</span><span class="p">,</span> y <span class="o">=</span> <span class="m">0.7</span><span class="p">)</span>
print<span class="p">(</span>a<span class="p">,</span> vp <span class="o">=</span> vpa_<span class="p">)</span>
print<span class="p">(</span>b<span class="p">,</span> vp <span class="o">=</span> vpb_<span class="p">)</span>
print<span class="p">(</span>c<span class="p">,</span> vp <span class="o">=</span> vpc_<span class="p">)</span>

center


Get the .Rmd file used to create this post at my github account – or .md file.

Written in Markdown, with help from knitr.

To leave a comment for the author, please follow the link and comment on their blog: Recology - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)