Using taxa and metacoder to explore taxonomy of Vancouver’s trees

[This article was first published on R on YIHAN WU, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

From the last post, Vancouver has several common genus in its collection, such as Prunus and Acer. So rather than analyzing the diversity of Vancouver tree species on a species level, we could, with the help of some R packages, visualize the variety of taxonomic groups.

First, we format the Vancouver tree dataset similar to before to get the proper scientific name. This is the same formatting done in the previous post.

library(dplyr)
library(purrr)
library(tidyr)
# read in  trees shp
# read in  trees df
trees_df<-read.csv("StreetTrees_CityWide.csv", stringsAsFactors = F)
trees_df<-trees_df[trees_df$LATITUDE > 40 & trees_df$LONGITUDE < -100,]
trees_df$SCIENTIFIC_NAME <- paste(trees_df$GENUS_NAME, trees_df$SPECIES_NM)
trees_df$GENUS_NAME <- as.character(trees_df$GENUS_NAME)
trees_df$SPECIES_NAME <- as.character(trees_df$SPECIES_NAME)

species_correction <- function(genus, species){
  if(species == "SPECIES"){
    return(genus)
  } else if (length(grep("\\sX$", species)) > 0){
    return(paste(genus," x ", unlist(strsplit(species, " "))[[1]]))
  } else {
    return(paste(genus, species))
  }
}
trees_df$SCIENTIFIC_NAME <- map2_chr(trees_df$GENUS_NAME, trees_df$SPECIES_NAME, species_correction)
head(trees_df)
##   TREE_ID CIVIC_NUMBER       STD_STREET NEIGHBOURHOOD_NAME
## 1     589         4694         W 8TH AV    WEST POINT GREY
## 2     592         1799 W KING EDWARD AV        SHAUGHNESSY
## 3     594         2401         W 1ST AV          KITSILANO
## 5     596         2428         W 1ST AV          KITSILANO
## 6     598         2428         W 1ST AV          KITSILANO
## 7     599         1725        BALSAM ST          KITSILANO
##          ON_STREET ON_STREET_BLOCK STREET_SIDE_NAME ASSIGNED
## 1        BLANCA ST            2400             EVEN        N
## 2 W KING EDWARD AV            1700              MED        N
## 3         W 1ST AV            2400              ODD        N
## 5         W 1ST AV            2400             EVEN        N
## 6         W 1ST AV            2400             EVEN        N
## 7         W 1ST AV            2400             EVEN        N
##   HEIGHT_RANGE_ID DIAMETER DATE_PLANTED PLANT_AREA ROOT_BARRIER CURB
## 1               5     32.0     20180128          B            N    Y
## 2               1      3.0     20100104         30            N    Y
## 3               4     14.0     20180128          B            N    Y
## 5               2     11.0     20180128          B            N    Y
## 6               2      3.0     20180128          B            N    Y
## 7               2     13.5     20180128          B            N    Y
##   CULTIVAR_NAME GENUS_NAME SPECIES_NAME              COMMON_NAME LATITUDE
## 1                    ALNUS       RUGOSA           SPECKLED ALDER 49.26532
## 2                    ABIES      GRANDIS                GRAND FIR 49.24952
## 3                    PINUS        NIGRA            AUSTRIAN PINE 49.27096
## 5                   PRUNUS    SARGENTII SARGENT FLOWERING CHERRY 49.27083
## 6                   PRUNUS    SARGENTII SARGENT FLOWERING CHERRY 49.27084
## 7                    PICEA        ABIES            NORWAY SPRUCE 49.27083
##   LONGITUDE  SCIENTIFIC_NAME
## 1 -123.2150     ALNUS RUGOSA
## 2 -123.1457    ABIES GRANDIS
## 3 -123.1599      PINUS NIGRA
## 5 -123.1602 PRUNUS SARGENTII
## 6 -123.1605 PRUNUS SARGENTII
## 7 -123.1600      PICEA ABIES

To get taxonomic information, we can take advantage of the rOpenSci package taxa. This package provides a function lookup_tax_data which will query NCBI taxonomy. We can then use the package metacoder to visualize a “heat tree” for datasets of taxon. This package was originally developed for next generation sequencing and metabarcoding in identifying species from DNA. So the “heat tree” has been diverted from its original purpose but still works very well.

# install.packages("taxa", "metacoder")
library(taxa)
## Warning: package 'taxa' was built under R version 3.5.1
library(metacoder)
## Warning: package 'metacoder' was built under R version 3.5.1
## This is metacoder verison 0.3.0 (stable). If you use metacoder for published research, please cite our paper:
## 
## Foster Z, Sharpton T and Grunwald N (2017). "Metacoder: An R package for visualization and manipulation of community taxonomic diversity data." PLOS Computational Biology, 13(2), pp. 1-15. doi: 10.1371/journal.pcbi.1005404
## 
## Enter `citation("metacoder")` for a BibTeX entry for this citation.

To save on space, we will make a table containing each species in our Vancouver dataset and the number of times it occurs.

species_df <- as.data.frame(table(trees_df$SCIENTIFIC_NAME))
names(species_df)<- c("SCIENTIFIC_NAME", "count")
head(species_df)
##      SCIENTIFIC_NAME count
## 1              ABIES    33
## 2     ABIES BALSAMEA     7
## 3     ABIES CONCOLOR     7
## 4      ABIES FRASERI     1
## 5      ABIES GRANDIS    65
## 6 ABIES NORDMANNIANA     1

Querying the NCBI servers for information takes a long time depending on the number of request. If you sign up for a NCBI API key (instructions here), and add it to your .Renviron file which is usually located in your home directory.

We are going to use the taxize function classification to obtain the taxonomy of each species. Then, we are inserting the taxonomy for each species back into the trees_df so each tree has an associated taxonomy in column.

library(taxize)

species_classifications <- classification(species_df$SCIENTIFIC_NAME, db = "ncbi", rows = 1)

This creates a list of data frames containing the taxonomy. We can then extract the taxonomy information by iterating over the classification list and drop all species which do not have taxonomy.

class_to_df <- function(x, ranking){
  if(!is.na(x)){ 
  return(x$name[x$rank == ranking])
    }
}

# species_df <- species_df[species_df$SCIENTIFIC_NAME != "PINUS THUNBERGII", ]
species_df$superkingdom <- as.character(sapply(species_classifications, class_to_df, ranking="superkingdom"))
species_df$kingdom <- as.character(sapply(species_classifications, class_to_df, ranking="kingdom"))
species_df$phylum <- as.character(sapply(species_classifications, class_to_df, ranking="phylum"))
species_df$subphylum <- as.character(sapply(species_classifications, class_to_df, ranking="subphylum"))
species_df$subclass <- as.character(sapply(species_classifications, class_to_df, ranking="subclass"))
species_df$order <- as.character(sapply(species_classifications, class_to_df, ranking="order"))
species_df$family <- as.character(sapply(species_classifications, class_to_df, ranking="family"))
species_df$genus <- as.character(sapply(species_classifications, class_to_df, ranking="genus"))
species_df$species <- as.character(sapply(species_classifications, class_to_df, ranking="species"))

species_df <- species_df[species_df$superkingdom != "NULL", ]

We can then join the taxonomy back to the original data set so each tree has its own taxonomy. We also drop the other columns not related to the taxonomy.

trees_df_class <- left_join(trees_df, species_df)
kept_columns <- c('SCIENTIFIC_NAME', 'phylum', 'subphylum', 'order', 'family', 'genus', 'species')
trees_df_class <- trees_df_class[, kept_columns]
## write.csv(trees_df_class, "trees_taxa.csv", row.names=F)

Next, we use the parse_tax_data function to … … parse the taxonomy columns in the data frame into a tax_map object.

By making a ranks vector, we tell parse_tax_data which columns contain the taxonomy and the ranking order.

ranks <- c('phylum', 'subphylum', 'order', 'family', 'genus', 'species')
obj <- parse_tax_data(trees_df_class, class_cols = ranks, named_by_rank = T)
## Warning: The following 7305 of 124349 input indexes have `NA` in their classifications:
##    29, 33, 37, 38, 64 ... 124328, 124329, 124342, 124347

The resulting object returned from parse_tax_data is actually a special R6 object rather than the typical data.frame or S4 object. But since metacoder is based on using taxa, it can directly interface with the R6 object produced and we can produce a “heat tree” without any further manipulation.

set.seed(123)

tax_graph <- obj %>% 
  filter_taxa(nchar(taxon_names)!=0) %>%
  filter_taxa(taxon_ranks == "genus", supertaxa = TRUE) %>%
  heat_tree(node_label = taxon_names, 
            node_size = n_obs, 
            node_label_size = n_obs, 
            node_color = n_obs, 
            node_size_range = c(0.01, .05),  
            node_color_range = RColorBrewer::brewer.pal(5, "BuGn"), 
            node_label_size_range = c(0.02, 0.05),
            node_color_axis_label = "Number of Trees")

tax_graph

The tax_map is filtered to only the genus level in order to produce a legible visualization compared to a species level heat tree.

Node size is related to number of observations. The most common genus in Vancouver are Prunus and Acer.

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] taxize_0.9.3    metacoder_0.3.0 taxa_0.3.1      tidyr_0.8.1    
## [5] purrr_0.2.5     dplyr_0.7.5    
## 
## loaded via a namespace (and not attached):
##  [1] zoo_1.8-2          tidyselect_0.2.4   xfun_0.3          
##  [4] reshape2_1.4.3     lattice_0.20-35    colorspace_1.3-2  
##  [7] htmltools_0.3.6    yaml_2.1.19        rlang_0.2.2       
## [10] pillar_1.2.3       glue_1.2.0         RColorBrewer_1.1-2
## [13] bindrcpp_0.2.2     foreach_1.4.4      bindr_0.1.1       
## [16] plyr_1.8.4         stringr_1.3.1      ggfittext_0.6.0   
## [19] munsell_0.5.0      blogdown_0.8       gtable_0.2.0      
## [22] codetools_0.2-15   evaluate_0.10.1    labeling_0.3      
## [25] knitr_1.20         parallel_3.5.0     curl_3.2          
## [28] Rcpp_0.12.17       backports_1.1.2    scales_0.5.0      
## [31] jsonlite_1.5       ggplot2_3.0.0      digest_0.6.15     
## [34] stringi_1.1.7      bookdown_0.7.17    grid_3.5.0        
## [37] rprojroot_1.3-2    tools_3.5.0        magrittr_1.5      
## [40] lazyeval_0.2.1     tibble_1.4.2       bold_0.5.0        
## [43] crul_0.5.2         crayon_1.3.4       ape_5.1           
## [46] pkgconfig_2.0.1    data.table_1.11.4  xml2_1.2.0        
## [49] assertthat_0.2.0   rmarkdown_1.10     reshape_0.8.7     
## [52] httr_1.3.1         iterators_1.0.9    R6_2.2.2          
## [55] igraph_1.2.1       nlme_3.1-137       compiler_3.5.0

Notes:

The taxa package has a function lookup_tax_data which queries NCBI for and creates a tax_map object. However, I have problems using tree_df, possibly due to the large amount of associated data which has to go into the tax_map object. Additionally, the taxon_ranks are not created directly when using lookup_tax_data. The example code for using lookup_tax_data is below. Also weirdly, lookup_tax_data breaks when querying for Pinus thunbergii the last three times I tried.

trees_df <- trees_df[trees_df$SCIENTIFIC_NAME != "PINUS THUNBERGII", c("SCIENTIFIC_NAME", "GENUS_NAME")]
tax_env <- lookup_tax_data(trees_df, type = "taxon_name", column = "SCIENTIFIC_NAME", database = "ncbi")

To leave a comment for the author, please follow the link and comment on their blog: R on YIHAN WU.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)