One R package for all your taxonomic needs

December 6, 2012
By

(This article was first published on Recology - R, and kindly contributed to R-bloggers)

UPDATE: there were some errors in the tests for taxize, so the binaries aren't avaiable yet. You can install from source though, see below.

Getting taxonomic information for the set of species you are studying can be a pain in the ass. You have to manually type, or paste in, your species one-by-one. Or, if you are lucky, there is a web service in which you can upload a list of species. Encyclopedia of Life (EOL) has a service where you can do this here. But is this reproducible? No.

Getting your taxonomic information for your species can now be done programatically in R. Do you want to get taxonomic information from ITIS. We got that. Tropicos? We got that too. uBio? No worries, we got that. What about theplantlist.org? Yep, got that. Encyclopedia of Life? Indeed. What about getting sequence data for a taxon? Oh hell yeah, you can get sequences available for a taxon across all genes, or get all records for a taxon for a specific gene.

Of course this is all possible because these data providers have open APIs so that we can facilitate your computer talking to their database. Fun!

Why get your taxonomic data programatically? Because it's 1) faster than by hand in web sites/looking up in books, 2) reproducible, especially if you share your code (damnit!), and 3) you can easily mash up your new taxonomic data to get sequences to build a phylogeny, etc.

I'll give a few examples of using taxize based around use cases, that is, stuff someone might actually do instead of what particular functions do.


Install packages. You can get from CRAN or GitHub.

# install.packages("ritis") # uncomment if not already installed
# install_github('taxize_', 'ropensci') # uncomment if not already installed
# install.packages("taxize", type="source") # uncomment if not already installed
library(ritis)
library(taxize)

Attach family names to a list of species.

I often have a list of species that I studied and simply want to get their family names to, for example, make a table for the paper I'm writing.

# For one species
itis_name(query = "Poa annua", get = "family")
Retrieving data for species ' Poa annua '
[1] "Poaceae"
# For many species
species <- c("Poa annua", "Abies procera", "Helianthus annuus", "Coffea arabica")
famnames <- sapply(species, itis_name, get = "family", USE.NAMES = F)
Retrieving data for species ' Poa annua '
Retrieving data for species ' Abies procera '
Retrieving data for species ' Helianthus annuus '
Retrieving data for species ' Coffea arabica '
data.frame(species = species, family = famnames)
            species     family
1         Poa annua    Poaceae
2     Abies procera   Pinaceae
3 Helianthus annuus Asteraceae
4    Coffea arabica  Rubiaceae

Resolve taxonomic names.

This is a common use case for ecologists/evolutionary biologists, or at least should be. That is, species names you have for your own data, or when using other's data, could be old names - and if you need the newest names for your species list, how can you make this as painless as possible? You can query taxonomic data from many different sources with taxize.

# The iPlantCollaborative provides access via API to their taxonomic name
# resolution service (TNRS)
mynames <- c("shorea robusta", "pandanus patina", "oryza sativa", "durio zibethinus", 
    "rubus ulmifolius", "asclepias curassavica", "pistacia lentiscus")
iplant_tnrsmatch(retrieve = "all", taxnames = c("helianthus annuus", "acacia", 
    "gossypium"), output = "names")
       AcceptedName   MatchFam MatchGenus MatchScore    Accept?
1 Helianthus annuus Asteraceae Helianthus          1 No opinion
2            Acacia   Fabaceae     Acacia          1 No opinion
3                                  Acacia          1 No opinion
4         Gossypium  Malvaceae  Gossypium          1 No opinion
     SubmittedNames
1 helianthus annuus
2            acacia
3            acacia
4         gossypium
# The global names resolver is another attempt at this, hitting many
# different data sources
gnr_resolve(names = c("Helianthus annuus", "Homo sapiens"), returndf = TRUE)
   data_source_id    submitted_name       name_string score
1               4 Helianthus annuus Helianthus annuus 0.988
3              10 Helianthus annuus Helianthus annuus 0.988
5              12 Helianthus annuus Helianthus annuus 0.988
8             110 Helianthus annuus Helianthus annuus 0.988
11            159 Helianthus annuus Helianthus annuus 0.988
13            166 Helianthus annuus Helianthus annuus 0.988
15            169 Helianthus annuus Helianthus annuus 0.988
2               4      Homo sapiens      Homo sapiens 0.988
4              10      Homo sapiens      Homo sapiens 0.988
6              12      Homo sapiens      Homo sapiens 0.988
7             107      Homo sapiens      Homo sapiens 0.988
9             122      Homo sapiens      Homo sapiens 0.988
10            123      Homo sapiens      Homo sapiens 0.988
12            159      Homo sapiens      Homo sapiens 0.988
14            168      Homo sapiens      Homo sapiens 0.988
16            169      Homo sapiens      Homo sapiens 0.988
                     title
1                     NCBI
3                 Freebase
5                      EOL
8     Illinois Wildflowers
11                 CU*STAR
13                   nlbif
15           uBio NameBank
2                     NCBI
4                 Freebase
6                      EOL
7                AskNature
9                 BioPedia
10                   AnAge
12                 CU*STAR
14 Index to Organism Names
16           uBio NameBank
# We can hit the Plantminer API too
plants <- c("Myrcia lingua", "Myrcia bella", "Ocotea pulchella", "Miconia", 
    "Coffea arabica var. amarella", "Bleh")
plantminer(plants)
Myrcia lingua 
Myrcia bella 
Ocotea pulchella 
Miconia 
Coffea arabica var. amarella 
Bleh 
              fam   genus                    sp             author
1       Myrtaceae  Myrcia                lingua   (O. Berg) Mattos
2       Myrtaceae  Myrcia                 bella           Cambess.
3       Lauraceae  Ocotea             pulchella (Nees & Mart.) Mez
4 Melastomataceae Miconia                    NA        Ruiz & Pav.
5       Rubiaceae  Coffea arabica var. amarella        A. Froehner
6              NA    Bleh                    NA                 NA
            source source.id   status confidence suggestion       database
1              TRO 100227036       NA         NA         NA       Tropicos
2             WCSP    131057 Accepted          H         NA The Plant List
3 WCSP (in review)    989758 Accepted          M         NA The Plant List
4              TRO  40018467       NA         NA         NA       Tropicos
5              TRO 100170231       NA         NA         NA       Tropicos
6               NA        NA       NA         NA       Baea             NA
# We made a light wrapper around the Taxonstand package to search
# Theplantlist.org too
splist <- c("Heliathus annuus", "Abies procera", "Poa annua", "Platanus occidentalis", 
    "Carex abrupta", "Arctostaphylos canescens", "Ocimum basilicum", "Vicia faba", 
    "Quercus kelloggii", "Lactuca serriola")
tpl_search(taxon = splist)
            Genus      Species Infraspecific Plant.Name.Index
1       Heliathus       annuus                          FALSE
2           Abies      procera                           TRUE
3             Poa        annua                           TRUE
4        Platanus occidentalis                           TRUE
5           Carex      abrupta                           TRUE
6  Arctostaphylos    canescens                           TRUE
7          Ocimum    basilicum                           TRUE
8           Vicia         faba                           TRUE
9         Quercus    kelloggii                           TRUE
10        Lactuca     serriola                           TRUE
   Taxonomic.status      Family      New.Genus  New.Species
1                                    Heliathus       annuus
2          Accepted    Pinaceae          Abies      procera
3          Accepted     Poaceae            Poa        annua
4          Accepted Platanaceae       Platanus occidentalis
5          Accepted  Cyperaceae          Carex      abrupta
6          Accepted   Ericaceae Arctostaphylos    canescens
7          Accepted   Lamiaceae         Ocimum    basilicum
8          Accepted Leguminosae          Vicia         faba
9          Accepted    Fagaceae        Quercus    kelloggii
10         Accepted  Compositae        Lactuca     serriola
   New.Infraspecific Authority  Typo WFormat
1                              FALSE   FALSE
2                       Rehder FALSE   FALSE
3                           L. FALSE   FALSE
4                           L. FALSE   FALSE
5               <NA>     Mack. FALSE   FALSE
6                       Eastw. FALSE   FALSE
7                           L. FALSE   FALSE
8                           L. FALSE   FALSE
9               <NA>     Newb. FALSE   FALSE
10                          L. FALSE   FALSE

Taxonomic hierarchies

I often want the full taxonomic hierarchy for a set of species. That is, give me the family, order, class, etc. for my list of species. There are two different easy ways to do this with taxize. The first example uses EOL.


Using EOL.

pageid <- eol_search("Quercus douglasii")$id[1]  # first need to search for the taxon's page on EOL
out <- eol_pages(taxonconceptID = pageid)  # then we nee to get the taxon ID used by EOL

# Notice that there are multiple different sources you can pull the
# hierarchy from. Note even that you can get the hierarchy from the ITIS
# service via this EOL API.
out
  identifier                 scientificName
1   46203061 Quercus douglasii Hook. & Arn.
2   48373995 Quercus douglasii Hook. & Arn.
                                  nameAccordingTo sourceIdentfier
1  Integrated Taxonomic Information System (ITIS)           19322
2 Species 2000 & ITIS Catalogue of Life: May 2012         9723391
  taxonRank
1   Species
2   Species
# Then the hierarchy!
eol_hierarchy(out[out$nameAccordingTo == "Species 2000 & ITIS Catalogue of Life: May 2012", 
    "identifier"])
  sourceIdentifier  taxonID parentNameUsageID taxonConceptID
1         11017504 48276627                 0            281
2         11017505 48276628          48276627            282
3         11017506 48276629          48276628            283
4         11022500 48373354          48276629           4184
5         11025284 48373677          48373354           4197
  scientificName taxonRank
1        Plantae   kingdom
2  Magnoliophyta    phylum
3  Magnoliopsida     class
4        Fagales     order
5       Fagaceae    family
eol_hierarchy(out[out$nameAccordingTo == "Integrated Taxonomic Information System (ITIS)", 
    "identifier"])  # and from ITIS, slightly different than ITIS output below, which includes taxa all the way down.
   sourceIdentifier  taxonID parentNameUsageID taxonConceptID
1            202422 46150613                 0            281
2            846492 46159776          46150613        8654492
3            846494 46161961          46159776       28818077
4            846496 46167532          46161961           4494
5            846504 46169010          46167532       28825126
6            846505 46169011          46169010            282
7             18063 46169012          46169011            283
8            846548 46202954          46169012       28859070
9             19273 46202955          46202954           4184
10            19275 46203022          46202955           4197
    scientificName     taxonRank
1          Plantae       kingdom
2   Viridaeplantae    subkingdom
3     Streptophyta  infrakingdom
4     Tracheophyta      division
5  Spermatophytina   subdivision
6     Angiospermae infradivision
7    Magnoliopsida         class
8          Rosanae    superorder
9          Fagales         order
10        Fagaceae        family

And getting a taxonomic hierarchy using ITIS.

# First, get the taxonomic serial number (TSN) that ITIS uses
mytsn <- get_tsn("Quercus douglasii", "sciname")
Retrieving data for species ' Quercus douglasii '
# Get the full taxonomic hierarchy for a taxon from the TSN
itis(mytsn, "getfullhierarchyfromtsn")
$`1`
        parentName parentTsn      rankName         taxonName    tsn
1                                  Kingdom           Plantae 202422
2          Plantae    202422    Subkingdom    Viridaeplantae 846492
3   Viridaeplantae    846492  Infrakingdom      Streptophyta 846494
4     Streptophyta    846494      Division      Tracheophyta 846496
5     Tracheophyta    846496   Subdivision   Spermatophytina 846504
6  Spermatophytina    846504 Infradivision      Angiospermae 846505
7     Angiospermae    846505         Class     Magnoliopsida  18063
8    Magnoliopsida     18063    Superorder           Rosanae 846548
9          Rosanae    846548         Order           Fagales  19273
10         Fagales     19273        Family          Fagaceae  19275
11        Fagaceae     19275         Genus           Quercus  19276
12         Quercus     19276       Species Quercus douglasii  19322
# But this can be even easier!
classification(get_tsn("Quercus douglasii"))  # Boom!
Retrieving data for species ' Quercus douglasii '
$`1`
        parentName parentTsn      rankName         taxonName    tsn
1                                  Kingdom           Plantae 202422
2          Plantae    202422    Subkingdom    Viridaeplantae 846492
3   Viridaeplantae    846492  Infrakingdom      Streptophyta 846494
4     Streptophyta    846494      Division      Tracheophyta 846496
5     Tracheophyta    846496   Subdivision   Spermatophytina 846504
6  Spermatophytina    846504 Infradivision      Angiospermae 846505
7     Angiospermae    846505         Class     Magnoliopsida  18063
8    Magnoliopsida     18063    Superorder           Rosanae 846548
9          Rosanae    846548         Order           Fagales  19273
10         Fagales     19273        Family          Fagaceae  19275
11        Fagaceae     19275         Genus           Quercus  19276
12         Quercus     19276       Species Quercus douglasii  19322
# You can also do this easy-peasy route to a taxonomic hierarchy using
# uBio
classification(get_uid("Ornithorhynchus anatinus"))
$`1`
       ScientificName         Rank    UID
1  cellular organisms      no rank 131567
2           Eukaryota superkingdom   2759
3        Opisthokonta      no rank  33154
4             Metazoa      kingdom  33208
5           Eumetazoa      no rank   6072
6           Bilateria      no rank  33213
7           Coelomata      no rank  33316
8       Deuterostomia      no rank  33511
9            Chordata       phylum   7711
10           Craniata    subphylum  89593
11         Vertebrata      no rank   7742
12      Gnathostomata   superclass   7776
13         Teleostomi      no rank 117570
14       Euteleostomi      no rank 117571
15      Sarcopterygii      no rank   8287
16          Tetrapoda      no rank  32523
17            Amniota      no rank  32524
18           Mammalia        class  40674
19        Prototheria      no rank   9254
20        Monotremata        order   9255
21  Ornithorhynchidae       family   9256
22    Ornithorhynchus        genus   9257

Sequences?

While you are at doing taxonomic stuff, you often wonder "hmmm, I wonder if there are any sequence data available for my species?" So, you can use get_seqs to search for specific genes for a species, and get_genes_avail to find out what genes are available for a certain species. These functions search for data on NCBI.

# Get sequences (sequence is provied in output, but hiding here for
# brevity). What's nice about this is that it gets the longest sequence
# avaialable for the gene you searched for, and if there isn't anything
# available, it lets you get a sequence from a congener if you set
# getrelated=TRUE. The last column in the output data.frame also tells you
# what species the sequence is from.
out <- get_seqs(taxon_name = "Acipenser brevirostrum", gene = c("5S rRNA"), 
    seqrange = "1:3000", getrelated = T, writetodf = F)
out[, !names(out) %in% "sequence"]
                   taxon                                         gene_desc
1 Acipenser brevirostrum Acipenser brevirostrum 5S rRNA gene, clone BRE92A
     gi_no     acc_no length                 spused
1 60417159 AJ745069.1    121 Acipenser brevirostrum
# Search for available sequences
out <- get_genes_avail(taxon_name = "Umbra limi", seqrange = "1:2000", getrelated = F)
out[grep("RAG1", out$genesavail, ignore.case = T), ]  # does the string 'RAG1' exist in any of the gene names
        spused length
414 Umbra limi    732
427 Umbra limi    959
434 Umbra limi   1631
                                                                            genesavail
414 isolate UlimA recombinase activating protein 1 (rag1) gene, exon 3 and partial cds
427           recombination-activating protein 1 (RAG1) gene, intron 2 and partial cds
434                        recombination-activating protein 1 (RAG1) gene, partial cds
    predicted
414  JX190826
427  AY459526
434  AY380548

Get the .Rmd file used to create this post at my github account - or .md file.

Written in Markdown, with help from knitr.

To leave a comment for the author, please follow the link and comment on his blog: Recology - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.