Plotting a ggtree and ggplots side by side

[This article was first published on Thomas Hackl | Coding Biologist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Plotting phylogenies and associated data side by side is a good way to explore evolutionary patterns in your data. In this post I will describe my approach for creating such plots in R using ggplot, ggtree, and patchwork.

ggtree itself comes with a few methods (gheatmap, msaplot, facet_plot) to display additional data next to the trees. Those methods, however, always combine the data and the tree within the same plot area. This considerably limits customizability. I, therefore, decided to go down a different route, and combine multiple individual plots in order to retain a high degree of customizability.

Let’s start with a simple example. We need a dummy tree, and two dummy data sets that we want to plot. Both, tree and data use the same labels to refer to the different organisms.

library(tidyverse)
library(ggtree)
library(patchwork)

# a tree
set.seed(1338)
tr <- rtree(10)
# and some dummy data
df1 <- tibble(
  # only some labels match
  label = c(tr$tip.label[sample(6, 6)], "u9", "v9"),
  value = label %>% str_sub(2) %>% as.numeric)
df2 <- tibble(
  label = rep(tr$tip.label, 4),
  category = rep(1:4, each=10),
  value = rnorm(40, 0, 3))

no_legend <- function() theme(legend.position="none")

# plot the tree,
gg_tr <- ggtree(tr) + geom_tiplab(align=TRUE) +
  scale_x_continuous(expand=expand_scale(0.2)) # make more room for the labels
# the data points, the histogram and the heatmap
gg_hist <- ggplot(df1, aes(label, value)) +
  geom_col(aes(fill=substr(label, 1, 1))) + no_legend()
gg_heat <- ggplot(df2, aes(category, label)) + geom_tile(aes(fill=value)) +
  scale_fill_gradient2() + no_legend()

gg_tr + gg_hist + gg_heat + plot_annotation(tag_levels="A")

ggsave("img/ggtree-composite-1.png", type='cairo', width=8, height=4)

ggtree-composite-1.png

So far so good. Now we want to reorder the data in the plots so it aligns with the leaves in our tree.

Under the hood, ggtrees are laid out on a numeric coordinate system. By default, the leaves match whole numbers, from 1 to the number of leaves. The easiest way to align data in other plots to the tree is to match the y-coordinates using common labels. So let's start with that.

tree_y() is a little helper function that takes a ggtree and a data frame with a label column. The function matches the ggtree and the data frame by the label column and returns the new y-coordinates for the data. We will use this function to transform the labels in our data on-the-fly in the ggplot aesthetics argument.

tree_y <-  function(ggtree, data){
  if(!inherits(ggtree, "ggtree"))
    stop("not a ggtree object")
  left_join(select(data, label), select(ggtree$data, label, y)) %>%
    pull(y)
}

# replot histogram and heatmap, match the y-coords to the tree
gg_hist <- ggplot(df1, aes(tree_y(gg_tr, df1), value)) +
  geom_col(aes(fill=substr(label, 1, 1))) + no_legend() +
  coord_flip() # flip this plot
gg_heat <- ggplot(df2, aes(category, y=tree_y(gg_tr, df2))) +
  geom_tile(aes(fill=value)) +
  scale_fill_gradient2() + no_legend()

gg_tr + gg_hist + gg_heat + plot_annotation(tag_levels="A")

ggsave("img/ggtree-composite-2.png", type='cairo', width=8, height=4)

ggtree-composite-2.png

OK, this doesn't look bad at all. The data in all plots is now reordered and matches the order of the tree leaves. However, it doesn't align properly yet. That has two reasons:

First, the plotted data have different y-limits: our first data set is missing data for the top tree leaf (t8), and in the heat map the tiles are centered around the leaf y-coordinates, and actually extend outwards by 0.5 units.

Second, the different plots also have different amounts of expansion space around their outer data points. This is a ggplot feature, and explained under the expand argument of the continuous scales: "The defaults are to expand the scale by 5% on each side for continuous variables, and by 0.6 units on each side for discrete variables".

To address those two related issues, we need a way to control the y-limits and the expansion space around it for the tree and the plots. For the tree, it's quite simple. All we need to do is settle on a fixed expansion space that gives enough room for leaf-centered objects such as tiles and bars in aligned plots. We don't have to do anything about the y-limits of the tree, because those are our reference.

scale_y_tree() is a simple wrapper around scale_y_continuous(), that when applied to the ggtree plot, resets the expansion space around the tree limits to 0.6 units on each side of the leaves.

For the plots, it's a bit more involved, because we want to derive the y-limits from the tree, rather than data. The way I made it work is by writing the wrapper function ggtreeplot(). The function basically behaves like ggplot(), but take a ggtree as an additional argument. It uses the ggtree to a) match the y-coordinates of the data by common labels (just as we did on-the-fly for the previous plots), and b) it computes new y-limits from the tree and adds those and our predefined expansion space of 0.6 to the plot.

The flip parameter is necessary when x and y in the plot will be flipped. Because in that case, we need to set our limits internally on the x-axis, which will later become the y-axis.

# overwrite the default expand for continuous scales
scale_y_tree <- function(expand=expand_scale(0, 0.6), ...){
    scale_y_continuous(expand=expand, ...)
}

# get the range of the ggtree y-axis data
tree_ylim <- function(ggtree){
  if(!inherits(ggtree, "ggtree"))
    stop("not a ggtree object")
  range(ggtree$data$y)
}

# plot data next to a ggtree aligned by shared labels
ggtreeplot <- function(ggtree, data = NULL, mapping = aes(), flip=FALSE,
     expand_limits=expand_scale(0,.6), ...){
  
  if(!inherits(ggtree, "ggtree"))
    stop("not a ggtree object")

  # match the tree limits
  limits <- tree_ylim(ggtree)
  limits[1] <- limits[1] + (limits[1] * expand_limits[1]) - expand_limits[2]
  limits[2] <- limits[2] + (limits[2] * expand_limits[3]) + expand_limits[4]
  
  if(flip){
    mapping <- modifyList(aes_(x=~x), mapping)
    data <- mutate(data, x=tree_y(ggtree, data))
    gg <- ggplot(data=data, mapping = mapping, ...) +
      scale_x_continuous(limits=limits, expand=c(0,0))
  }else{
    mapping <- modifyList(aes_(y=~y), mapping)
    data <- mutate(data, y=tree_y(ggtree, data))
    gg <- ggplot(data=data, mapping = mapping, ...) +
      scale_y_continuous(limits=limits, expand=c(0,0))
  }
  gg
}

# get rid of superfluous axis - this works after coord_flip, so it also works
# for the rotated histogram
no_y_axis <- function () 
  theme(axis.line.y = element_blank(), 
        axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

Now we can plot our tree and the data side by side using our new functions. We plot the tree as before but add the fixed y scale. For the data, we replace the ggplot() call with the ggtreeplot() call. Setting the y-aesthetic (or x in case of flipped plots) is no longer necessary. ggplotree() assumes that it should point to the coordinates we got from matching the labels from the data to the tree.

gg_tr <- ggtree(tr) + geom_tiplab(align=TRUE) +
  scale_x_continuous(expand=expand_scale(0.2)) + # make more room for the labels
  scale_y_tree()
gg_hist <- ggtreeplot(gg_tr, df1, aes(y=value), flip=TRUE) +
  geom_col(aes(fill=substr(label, 1, 1))) + no_legend() +
  coord_flip() + no_y_axis()
gg_heat <- ggtreeplot(gg_tr, df2, aes(x=category)) + geom_tile(aes(fill=value)) +
  scale_fill_gradient2() + no_legend() + no_y_axis() 

gg_tr + gg_hist + gg_heat + plot_annotation(tag_levels="A")

ggsave("img/ggtree-composite-3.png", type='cairo', width=8, height=4)

ggtree-composite-3.png

Et voilà! A tree and two beautiful plots, side by side, and perfectly aligned. The grammar for this final plot is pretty much identical to what we used for the initial unaligned ggtree/ggplots. This makes it very easy to go from a set of generic plots to this composite plot. At the same time, we retain the level modularity and customizability that we had with the original ggplots. This, in my opinion, is what makes this approach powerful.

To leave a comment for the author, please follow the link and comment on their blog: Thomas Hackl | Coding Biologist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)