One R package for all your taxonomic needs
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
UPDATE: there were some errors in the tests for taxize
, so the binaries aren’t avaiable yet. You can install from source though, see below.
Getting taxonomic information for the set of species you are studying can be a pain in the ass. You have to manually type, or paste in, your species one-by-one. Or, if you are lucky, there is a web service in which you can upload a list of species. Encyclopedia of Life (EOL) has a service where you can do this here. But is this reproducible? No.
Getting your taxonomic information for your species can now be done programatically in R. Do you want to get taxonomic information from ITIS. We got that. Tropicos? We got that too. uBio? No worries, we got that. What about theplantlist.org? Yep, got that. Encyclopedia of Life? Indeed. What about getting sequence data for a taxon? Oh hell yeah, you can get sequences available for a taxon across all genes, or get all records for a taxon for a specific gene.
Of course this is all possible because these data providers have open APIs so that we can facilitate your computer talking to their database. Fun!
Why get your taxonomic data programatically? Because it’s 1) faster than by hand in web sites/looking up in books, 2) reproducible, especially if you share your code (damnit!), and 3) you can easily mash up your new taxonomic data to get sequences to build a phylogeny, etc.
I’ll give a few examples of using taxize
based around use cases, that is, stuff someone might actually do instead of what particular functions do.
Install packages. You can get from CRAN or GitHub.
<span class="c1"># install.packages("ritis") # uncomment if not already installed</span>
<span class="c1"># install_github('taxize_', 'ropensci') # uncomment if not already installed</span>
<span class="c1"># install.packages("taxize", type="source") # uncomment if not already installed</span>
library<span class="p">(</span>ritis<span class="p">)</span>
library<span class="p">(</span>taxize<span class="p">)</span>
Attach family names to a list of species.
I often have a list of species that I studied and simply want to get their family names to, for example, make a table for the paper I’m writing.
<span class="c1"># For one species</span>
itis_name<span class="p">(</span>query <span class="o">=</span> <span class="s">"Poa annua"</span><span class="p">,</span> get <span class="o">=</span> <span class="s">"family"</span><span class="p">)</span>
Retrieving data for species ' Poa annua '
[1] "Poaceae"
<span class="c1"># For many species</span>
species <span class="o"><-</span> c<span class="p">(</span><span class="s">"Poa annua"</span><span class="p">,</span> <span class="s">"Abies procera"</span><span class="p">,</span> <span class="s">"Helianthus annuus"</span><span class="p">,</span> <span class="s">"Coffea arabica"</span><span class="p">)</span>
famnames <span class="o"><-</span> sapply<span class="p">(</span>species<span class="p">,</span> itis_name<span class="p">,</span> get <span class="o">=</span> <span class="s">"family"</span><span class="p">,</span> USE.NAMES <span class="o">=</span> <span class="k-Variable">F</span><span class="p">)</span>
Retrieving data for species ' Poa annua '
Retrieving data for species ' Abies procera '
Retrieving data for species ' Helianthus annuus '
Retrieving data for species ' Coffea arabica '
data.frame<span class="p">(</span>species <span class="o">=</span> species<span class="p">,</span> family <span class="o">=</span> famnames<span class="p">)</span>
species family
1 Poa annua Poaceae
2 Abies procera Pinaceae
3 Helianthus annuus Asteraceae
4 Coffea arabica Rubiaceae
Resolve taxonomic names.
This is a common use case for ecologists/evolutionary biologists, or at least should be. That is, species names you have for your own data, or when using other’s data, could be old names – and if you need the newest names for your species list, how can you make this as painless as possible? You can query taxonomic data from many different sources with taxize
.
<span class="c1"># The iPlantCollaborative provides access via API to their taxonomic name</span>
<span class="c1"># resolution service (TNRS)</span>
mynames <span class="o"><-</span> c<span class="p">(</span><span class="s">"shorea robusta"</span><span class="p">,</span> <span class="s">"pandanus patina"</span><span class="p">,</span> <span class="s">"oryza sativa"</span><span class="p">,</span> <span class="s">"durio zibethinus"</span><span class="p">,</span>
<span class="s">"rubus ulmifolius"</span><span class="p">,</span> <span class="s">"asclepias curassavica"</span><span class="p">,</span> <span class="s">"pistacia lentiscus"</span><span class="p">)</span>
iplant_tnrsmatch<span class="p">(</span>retrieve <span class="o">=</span> <span class="s">"all"</span><span class="p">,</span> taxnames <span class="o">=</span> c<span class="p">(</span><span class="s">"helianthus annuus"</span><span class="p">,</span> <span class="s">"acacia"</span><span class="p">,</span>
<span class="s">"gossypium"</span><span class="p">),</span> output <span class="o">=</span> <span class="s">"names"</span><span class="p">)</span>
AcceptedName MatchFam MatchGenus MatchScore Accept?
1 Helianthus annuus Asteraceae Helianthus 1 No opinion
2 Acacia Fabaceae Acacia 1 No opinion
3 Acacia 1 No opinion
4 Gossypium Malvaceae Gossypium 1 No opinion
SubmittedNames
1 helianthus annuus
2 acacia
3 acacia
4 gossypium
<span class="c1"># The global names resolver is another attempt at this, hitting many</span>
<span class="c1"># different data sources</span>
gnr_resolve<span class="p">(</span>names <span class="o">=</span> c<span class="p">(</span><span class="s">"Helianthus annuus"</span><span class="p">,</span> <span class="s">"Homo sapiens"</span><span class="p">),</span> returndf <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
data_source_id submitted_name name_string score
1 4 Helianthus annuus Helianthus annuus 0.988
3 10 Helianthus annuus Helianthus annuus 0.988
5 12 Helianthus annuus Helianthus annuus 0.988
8 110 Helianthus annuus Helianthus annuus 0.988
11 159 Helianthus annuus Helianthus annuus 0.988
13 166 Helianthus annuus Helianthus annuus 0.988
15 169 Helianthus annuus Helianthus annuus 0.988
2 4 Homo sapiens Homo sapiens 0.988
4 10 Homo sapiens Homo sapiens 0.988
6 12 Homo sapiens Homo sapiens 0.988
7 107 Homo sapiens Homo sapiens 0.988
9 122 Homo sapiens Homo sapiens 0.988
10 123 Homo sapiens Homo sapiens 0.988
12 159 Homo sapiens Homo sapiens 0.988
14 168 Homo sapiens Homo sapiens 0.988
16 169 Homo sapiens Homo sapiens 0.988
title
1 NCBI
3 Freebase
5 EOL
8 Illinois Wildflowers
11 CU*STAR
13 nlbif
15 uBio NameBank
2 NCBI
4 Freebase
6 EOL
7 AskNature
9 BioPedia
10 AnAge
12 CU*STAR
14 Index to Organism Names
16 uBio NameBank
<span class="c1"># We can hit the Plantminer API too</span>
plants <span class="o"><-</span> c<span class="p">(</span><span class="s">"Myrcia lingua"</span><span class="p">,</span> <span class="s">"Myrcia bella"</span><span class="p">,</span> <span class="s">"Ocotea pulchella"</span><span class="p">,</span> <span class="s">"Miconia"</span><span class="p">,</span>
<span class="s">"Coffea arabica var. amarella"</span><span class="p">,</span> <span class="s">"Bleh"</span><span class="p">)</span>
plantminer<span class="p">(</span>plants<span class="p">)</span>
Myrcia lingua
Myrcia bella
Ocotea pulchella
Miconia
Coffea arabica var. amarella
Bleh
fam genus sp author
1 Myrtaceae Myrcia lingua (O. Berg) Mattos
2 Myrtaceae Myrcia bella Cambess.
3 Lauraceae Ocotea pulchella (Nees & Mart.) Mez
4 Melastomataceae Miconia NA Ruiz & Pav.
5 Rubiaceae Coffea arabica var. amarella A. Froehner
6 NA Bleh NA NA
source source.id status confidence suggestion database
1 TRO 100227036 NA NA NA Tropicos
2 WCSP 131057 Accepted H NA The Plant List
3 WCSP (in review) 989758 Accepted M NA The Plant List
4 TRO 40018467 NA NA NA Tropicos
5 TRO 100170231 NA NA NA Tropicos
6 NA NA NA NA Baea NA
<span class="c1"># We made a light wrapper around the Taxonstand package to search</span>
<span class="c1"># Theplantlist.org too</span>
splist <span class="o"><-</span> c<span class="p">(</span><span class="s">"Heliathus annuus"</span><span class="p">,</span> <span class="s">"Abies procera"</span><span class="p">,</span> <span class="s">"Poa annua"</span><span class="p">,</span> <span class="s">"Platanus occidentalis"</span><span class="p">,</span>
<span class="s">"Carex abrupta"</span><span class="p">,</span> <span class="s">"Arctostaphylos canescens"</span><span class="p">,</span> <span class="s">"Ocimum basilicum"</span><span class="p">,</span> <span class="s">"Vicia faba"</span><span class="p">,</span>
<span class="s">"Quercus kelloggii"</span><span class="p">,</span> <span class="s">"Lactuca serriola"</span><span class="p">)</span>
tpl_search<span class="p">(</span>taxon <span class="o">=</span> splist<span class="p">)</span>
Genus Species Infraspecific Plant.Name.Index
1 Heliathus annuus FALSE
2 Abies procera TRUE
3 Poa annua TRUE
4 Platanus occidentalis TRUE
5 Carex abrupta TRUE
6 Arctostaphylos canescens TRUE
7 Ocimum basilicum TRUE
8 Vicia faba TRUE
9 Quercus kelloggii TRUE
10 Lactuca serriola TRUE
Taxonomic.status Family New.Genus New.Species
1 Heliathus annuus
2 Accepted Pinaceae Abies procera
3 Accepted Poaceae Poa annua
4 Accepted Platanaceae Platanus occidentalis
5 Accepted Cyperaceae Carex abrupta
6 Accepted Ericaceae Arctostaphylos canescens
7 Accepted Lamiaceae Ocimum basilicum
8 Accepted Leguminosae Vicia faba
9 Accepted Fagaceae Quercus kelloggii
10 Accepted Compositae Lactuca serriola
New.Infraspecific Authority Typo WFormat
1 FALSE FALSE
2 Rehder FALSE FALSE
3 L. FALSE FALSE
4 L. FALSE FALSE
5 <NA> Mack. FALSE FALSE
6 Eastw. FALSE FALSE
7 L. FALSE FALSE
8 L. FALSE FALSE
9 <NA> Newb. FALSE FALSE
10 L. FALSE FALSE
Taxonomic hierarchies
I often want the full taxonomic hierarchy for a set of species. That is, give me the family, order, class, etc. for my list of species. There are two different easy ways to do this with taxize
. The first example uses EOL.
Using EOL.
pageid <span class="o"><-</span> eol_search<span class="p">(</span><span class="s">"Quercus douglasii"</span><span class="p">)</span><span class="o">$</span>id<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="c1"># first need to search for the taxon's page on EOL</span>
out <span class="o"><-</span> eol_pages<span class="p">(</span>taxonconceptID <span class="o">=</span> pageid<span class="p">)</span> <span class="c1"># then we nee to get the taxon ID used by EOL</span>
<span class="c1"># Notice that there are multiple different sources you can pull the</span>
<span class="c1"># hierarchy from. Note even that you can get the hierarchy from the ITIS</span>
<span class="c1"># service via this EOL API.</span>
out
identifier scientificName
1 46203061 Quercus douglasii Hook. & Arn.
2 48373995 Quercus douglasii Hook. & Arn.
nameAccordingTo sourceIdentfier
1 Integrated Taxonomic Information System (ITIS) 19322
2 Species 2000 & ITIS Catalogue of Life: May 2012 9723391
taxonRank
1 Species
2 Species
<span class="c1"># Then the hierarchy!</span>
eol_hierarchy<span class="p">(</span>out<span class="p">[</span>out<span class="o">$</span>nameAccordingTo <span class="o">==</span> <span class="s">"Species 2000 & ITIS Catalogue of Life: May 2012"</span><span class="p">,</span>
<span class="s">"identifier"</span><span class="p">])</span>
sourceIdentifier taxonID parentNameUsageID taxonConceptID
1 11017504 48276627 0 281
2 11017505 48276628 48276627 282
3 11017506 48276629 48276628 283
4 11022500 48373354 48276629 4184
5 11025284 48373677 48373354 4197
scientificName taxonRank
1 Plantae kingdom
2 Magnoliophyta phylum
3 Magnoliopsida class
4 Fagales order
5 Fagaceae family
eol_hierarchy<span class="p">(</span>out<span class="p">[</span>out<span class="o">$</span>nameAccordingTo <span class="o">==</span> <span class="s">"Integrated Taxonomic Information System (ITIS)"</span><span class="p">,</span>
<span class="s">"identifier"</span><span class="p">])</span> <span class="c1"># and from ITIS, slightly different than ITIS output below, which includes taxa all the way down.</span>
sourceIdentifier taxonID parentNameUsageID taxonConceptID
1 202422 46150613 0 281
2 846492 46159776 46150613 8654492
3 846494 46161961 46159776 28818077
4 846496 46167532 46161961 4494
5 846504 46169010 46167532 28825126
6 846505 46169011 46169010 282
7 18063 46169012 46169011 283
8 846548 46202954 46169012 28859070
9 19273 46202955 46202954 4184
10 19275 46203022 46202955 4197
scientificName taxonRank
1 Plantae kingdom
2 Viridaeplantae subkingdom
3 Streptophyta infrakingdom
4 Tracheophyta division
5 Spermatophytina subdivision
6 Angiospermae infradivision
7 Magnoliopsida class
8 Rosanae superorder
9 Fagales order
10 Fagaceae family
And getting a taxonomic hierarchy using ITIS.
<span class="c1"># First, get the taxonomic serial number (TSN) that ITIS uses</span>
mytsn <span class="o"><-</span> get_tsn<span class="p">(</span><span class="s">"Quercus douglasii"</span><span class="p">,</span> <span class="s">"sciname"</span><span class="p">)</span>
Retrieving data for species ' Quercus douglasii '
<span class="c1"># Get the full taxonomic hierarchy for a taxon from the TSN</span>
itis<span class="p">(</span>mytsn<span class="p">,</span> <span class="s">"getfullhierarchyfromtsn"</span><span class="p">)</span>
$`1`
parentName parentTsn rankName taxonName tsn
1 Kingdom Plantae 202422
2 Plantae 202422 Subkingdom Viridaeplantae 846492
3 Viridaeplantae 846492 Infrakingdom Streptophyta 846494
4 Streptophyta 846494 Division Tracheophyta 846496
5 Tracheophyta 846496 Subdivision Spermatophytina 846504
6 Spermatophytina 846504 Infradivision Angiospermae 846505
7 Angiospermae 846505 Class Magnoliopsida 18063
8 Magnoliopsida 18063 Superorder Rosanae 846548
9 Rosanae 846548 Order Fagales 19273
10 Fagales 19273 Family Fagaceae 19275
11 Fagaceae 19275 Genus Quercus 19276
12 Quercus 19276 Species Quercus douglasii 19322
<span class="c1"># But this can be even easier!</span>
classification<span class="p">(</span>get_tsn<span class="p">(</span><span class="s">"Quercus douglasii"</span><span class="p">))</span> <span class="c1"># Boom!</span>
Retrieving data for species ' Quercus douglasii '
$`1`
parentName parentTsn rankName taxonName tsn
1 Kingdom Plantae 202422
2 Plantae 202422 Subkingdom Viridaeplantae 846492
3 Viridaeplantae 846492 Infrakingdom Streptophyta 846494
4 Streptophyta 846494 Division Tracheophyta 846496
5 Tracheophyta 846496 Subdivision Spermatophytina 846504
6 Spermatophytina 846504 Infradivision Angiospermae 846505
7 Angiospermae 846505 Class Magnoliopsida 18063
8 Magnoliopsida 18063 Superorder Rosanae 846548
9 Rosanae 846548 Order Fagales 19273
10 Fagales 19273 Family Fagaceae 19275
11 Fagaceae 19275 Genus Quercus 19276
12 Quercus 19276 Species Quercus douglasii 19322
<span class="c1"># You can also do this easy-peasy route to a taxonomic hierarchy using</span>
<span class="c1"># uBio</span>
classification<span class="p">(</span>get_uid<span class="p">(</span><span class="s">"Ornithorhynchus anatinus"</span><span class="p">))</span>
$`1`
ScientificName Rank UID
1 cellular organisms no rank 131567
2 Eukaryota superkingdom 2759
3 Opisthokonta no rank 33154
4 Metazoa kingdom 33208
5 Eumetazoa no rank 6072
6 Bilateria no rank 33213
7 Coelomata no rank 33316
8 Deuterostomia no rank 33511
9 Chordata phylum 7711
10 Craniata subphylum 89593
11 Vertebrata no rank 7742
12 Gnathostomata superclass 7776
13 Teleostomi no rank 117570
14 Euteleostomi no rank 117571
15 Sarcopterygii no rank 8287
16 Tetrapoda no rank 32523
17 Amniota no rank 32524
18 Mammalia class 40674
19 Prototheria no rank 9254
20 Monotremata order 9255
21 Ornithorhynchidae family 9256
22 Ornithorhynchus genus 9257
Sequences?
While you are at doing taxonomic stuff, you often wonder “hmmm, I wonder if there are any sequence data available for my species?” So, you can use get_seqs
to search for specific genes for a species, and get_genes_avail
to find out what genes are available for a certain species. These functions search for data on NCBI.
<span class="c1"># Get sequences (sequence is provied in output, but hiding here for</span>
<span class="c1"># brevity). What's nice about this is that it gets the longest sequence</span>
<span class="c1"># avaialable for the gene you searched for, and if there isn't anything</span>
<span class="c1"># available, it lets you get a sequence from a congener if you set</span>
<span class="c1"># getrelated=TRUE. The last column in the output data.frame also tells you</span>
<span class="c1"># what species the sequence is from.</span>
out <span class="o"><-</span> get_seqs<span class="p">(</span>taxon_name <span class="o">=</span> <span class="s">"Acipenser brevirostrum"</span><span class="p">,</span> gene <span class="o">=</span> c<span class="p">(</span><span class="s">"5S rRNA"</span><span class="p">),</span>
seqrange <span class="o">=</span> <span class="s">"1:3000"</span><span class="p">,</span> getrelated <span class="o">=</span> <span class="k-Variable">T</span><span class="p">,</span> writetodf <span class="o">=</span> <span class="k-Variable">F</span><span class="p">)</span>
out<span class="p">[,</span> <span class="o">!</span>names<span class="p">(</span>out<span class="p">)</span> <span class="o">%in%</span> <span class="s">"sequence"</span><span class="p">]</span>
taxon gene_desc
1 Acipenser brevirostrum Acipenser brevirostrum 5S rRNA gene, clone BRE92A
gi_no acc_no length spused
1 60417159 AJ745069.1 121 Acipenser brevirostrum
<span class="c1"># Search for available sequences</span>
out <span class="o"><-</span> get_genes_avail<span class="p">(</span>taxon_name <span class="o">=</span> <span class="s">"Umbra limi"</span><span class="p">,</span> seqrange <span class="o">=</span> <span class="s">"1:2000"</span><span class="p">,</span> getrelated <span class="o">=</span> <span class="k-Variable">F</span><span class="p">)</span>
out<span class="p">[</span>grep<span class="p">(</span><span class="s">"RAG1"</span><span class="p">,</span> out<span class="o">$</span>genesavail<span class="p">,</span> ignore.case <span class="o">=</span> <span class="k-Variable">T</span><span class="p">),</span> <span class="p">]</span> <span class="c1"># does the string 'RAG1' exist in any of the gene names</span>
spused length
414 Umbra limi 732
427 Umbra limi 959
434 Umbra limi 1631
genesavail
414 isolate UlimA recombinase activating protein 1 (rag1) gene, exon 3 and partial cds
427 recombination-activating protein 1 (RAG1) gene, intron 2 and partial cds
434 recombination-activating protein 1 (RAG1) gene, partial cds
predicted
414 JX190826
427 AY459526
434 AY380548
Get the .Rmd file used to create this post at my github account – or .md file.
Written in Markdown, with help from knitr.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.