Resolving species names when you have a lot of them

January 25, 2013
By

(This article was first published on Recology - R, and kindly contributed to R-bloggers)

taxize use case: Resolving species names when you have a lot of them

Species names can be a pain in the ass, especially if you are an ecologist. We ecologists aren't trained in taxonomy, yet we often end up with huge species lists. Of course we want to correct any spelling errors in the names, and get the newest names for our species, resolve any synonyms, etc.

We are building tools into our R package taxize, that will let you check your species names to make sure they are correct.

An important use case is when you have a lot of species. Someone wrote to us recently, saying that they had thousands of species, and they wanted to know how to check their species names efficiently in R.

Below is an example of how to do this.


Install taxize

# install_github('taxize_', 'ropensci') # install the GitHub version, not
# the CRAN version, uncomment if you don't have it installed
library(taxize)

Get some species, in this case all species in the Scrophulariaceae family from theplantlist.org

tpl_get(dir_ = "~/foo2", family = "Scrophulariaceae")
## Reading and writing csv files to ~/foo2...
dat <- read.csv("~/foo2/Scrophulariaceae.csv")

Lets grab the species and concatenate to genus_species

species <- as.character(ddply(dat[, c("Genus", "Species")], .(), transform, 
    gen_sp = as.factor(paste(Genus, Species, sep = " ")))[, 4])

It's better to do many smaller calls to a web API instead of few big ones to be nice to the database maintainers.

## Define function to split up your species list into useable chuncks
slice <- function(input, by = 2) {
    starts <- seq(1, length(input), by)
    tt <- lapply(starts, function(y) input[y:(y + (by - 1))])
    llply(tt, function(x) x[!is.na(x)])
}
species_split <- slice(species, by = 100)

Query for your large species list with pauses between calls, with 3 seconds in between calls to not hit the web service too hard. Using POST method here instead of GET - required when you have a lot of species.

tnrs_safe <- failwith(NULL, tnrs)  # in case some calls fail, will continue
out <- llply(species_split, function(x) tnrs_safe(x, getpost = "POST", sleep = 3))
Calling http://taxosaurus.org/retrieve/90fcd9ae425ad7c6103b06dd9fd78ae2
Calling http://taxosaurus.org/retrieve/223f73b83fcddcb8b6187966963660a8
Calling http://taxosaurus.org/retrieve/72bacdbb8938316e321d4c709c8cdd09
Calling http://taxosaurus.org/retrieve/979ce9cc4dec376710f61de162e1294e
Calling http://taxosaurus.org/retrieve/03a39a124561fec2fdfc0f483d9fb607
Calling http://taxosaurus.org/retrieve/d4bf4e5a1403f45a1be1ca3dd87785d7
Calling http://taxosaurus.org/retrieve/a9a9bdde6fda7e325d80120e27ccb480
Calling http://taxosaurus.org/retrieve/215ccdcf2b00362278bf19d1942e1395
Calling http://taxosaurus.org/retrieve/9d43c0b99b4dfb5ea1b435adab17b980
Calling http://taxosaurus.org/retrieve/42e166f8e43f1fb349e36459cd5938b3
Calling http://taxosaurus.org/retrieve/2c42e4b5227c5464f9bfeeafcdf0651d
# Looks like we got some data back for each element of our species list
lapply(out, head)[1:2]  # just look at the first two
[[1]]
                 submittedName                 acceptedName    sourceId
1        Aptosimum welwitschii                              iPlant_TNRS
2        Anticharis ebracteata        Anticharis ebracteata iPlant_TNRS
3            Aptosimum lineare            Aptosimum lineare iPlant_TNRS
4     Antherothamnus pearsonii     Antherothamnus pearsonii iPlant_TNRS
5 Barthlottia madagascariensis Barthlottia madagascariensis iPlant_TNRS
6         Agathelpis mucronata                              iPlant_TNRS
  score                  matchedName     annotations
1     1        Aptosimum welwitschii                
2     1        Anticharis ebracteata          Schinz
3     1            Aptosimum lineare Marloth & Engl.
4     1     Antherothamnus pearsonii        N.E. Br.
5     1 Barthlottia madagascariensis      Eb. Fisch.
6     1         Agathelpis mucronata                
                                    uri
1                                      
2 http://www.tropicos.org/Name/29202501
3 http://www.tropicos.org/Name/29202525
4 http://www.tropicos.org/Name/29202728
5 http://www.tropicos.org/Name/50089700
6                                      

[[2]]
                     submittedName           acceptedName    sourceId
1 Buddleja pichinchensis x bullata Buddleja pichinchensis iPlant_TNRS
2                 Buddleja soratae       Buddleja soratae iPlant_TNRS
3              Buddleja euryphylla    Buddleja euryphylla iPlant_TNRS
4                  Buddleja incana        Buddleja incana iPlant_TNRS
5                  Buddleja incana                 Incana        NCBI
6                    Buddleja nana Buddleja brachystachya iPlant_TNRS
  score            matchedName        annotations
1   0.9 Buddleja pichinchensis              Kunth
2   1.0       Buddleja soratae           Kraenzl.
3   1.0    Buddleja euryphylla Standl. & Steyerm.
4   1.0        Buddleja incana        Ruiz & Pav.
5   1.0        Buddleja incana               none
6   1.0          Buddleja nana              Diels
                                          uri
1       http://www.tropicos.org/Name/19000333
2       http://www.tropicos.org/Name/19001018
3       http://www.tropicos.org/Name/19000790
4       http://www.tropicos.org/Name/19000596
5 http://www.ncbi.nlm.nih.gov/taxonomy/405077
6       http://www.tropicos.org/Name/19001133
# Now we can put them back together as so into one data.frame if you like
outdf <- ldply(out)
head(outdf)
                 submittedName                 acceptedName    sourceId
1        Aptosimum welwitschii                              iPlant_TNRS
2        Anticharis ebracteata        Anticharis ebracteata iPlant_TNRS
3            Aptosimum lineare            Aptosimum lineare iPlant_TNRS
4     Antherothamnus pearsonii     Antherothamnus pearsonii iPlant_TNRS
5 Barthlottia madagascariensis Barthlottia madagascariensis iPlant_TNRS
6         Agathelpis mucronata                              iPlant_TNRS
  score                  matchedName     annotations
1     1        Aptosimum welwitschii                
2     1        Anticharis ebracteata          Schinz
3     1            Aptosimum lineare Marloth & Engl.
4     1     Antherothamnus pearsonii        N.E. Br.
5     1 Barthlottia madagascariensis      Eb. Fisch.
6     1         Agathelpis mucronata                
                                    uri
1                                      
2 http://www.tropicos.org/Name/29202501
3 http://www.tropicos.org/Name/29202525
4 http://www.tropicos.org/Name/29202728
5 http://www.tropicos.org/Name/50089700
6

Note that there are multiple names for some species because data sources have different names for the same species (resulting in more than one row in the data.frame 'outdf' for a species). We are leaving this up to the user to decide which to use. For example, for the species Buddleja montana there are two names for in the output

data <- ddply(outdf, .(submittedName), summarize, length(submittedName))
outdf[outdf$submittedName %in% as.character(data[data$..1 > 1, ][6, "submittedName"]), 
    ]
       submittedName     acceptedName    sourceId score      matchedName
123 Buddleja montana Buddleja montana iPlant_TNRS     1 Buddleja montana
124 Buddleja montana          Montana        NCBI     1 Buddleja montana
         annotations                                         uri
123 Britton ex Rusby       http://www.tropicos.org/Name/19000601
124             none http://www.ncbi.nlm.nih.gov/taxonomy/441235

The source iPlant matched the name, but NCBI actually gave back a genus of cricket (follow the link under the column uri for Montana). If you look at the page for Buddleja on NCBI here there is no Buddleja montana at all.

Another thing we could do is look at the score that is returned. Let's look at those that are less than 1 (i.e., )

outdf[outdf$score < 1, ]
                        submittedName           acceptedName    sourceId
94   Buddleja pichinchensis x bullata Buddleja pichinchensis iPlant_TNRS
340                Diascia ellaphieae                        iPlant_TNRS
495              Eremophila decipiens                        iPlant_TNRS
500            Eremophila grandiflora             Eremophila iPlant_TNRS
808           Jamesbrittneia hilliard                        iPlant_TNRS
1051                 Verbascum patris              Verbascum iPlant_TNRS
1081             Verbascum barnadesii              Verbascum iPlant_TNRS
1097              Verbascum calycosum              Verbascum iPlant_TNRS
     score            matchedName annotations
94    0.90 Buddleja pichinchensis       Kunth
340   0.98      Diascia ellaphiae            
495   0.98  Eremophila decipiense            
500   0.50             Eremophila      R. Br.
808   0.50         Jamesbrittenia            
1051  0.50              Verbascum          L.
1081  0.50              Verbascum          L.
1097  0.50              Verbascum          L.
                                       uri
94   http://www.tropicos.org/Name/19000333
340                                       
495                                       
500  http://www.tropicos.org/Name/40004761
808                                       
1051 http://www.tropicos.org/Name/40023766
1081 http://www.tropicos.org/Name/40023766
1097 http://www.tropicos.org/Name/40023766

As we got this speies list from theplantlist.org, there aren't that many mistakes, but if it was my species list you know there would be many :)


That's it. Try it out and let us know if you have any questions at [email protected], or ask questions/report problems at GitHub.


Get the .Rmd file used to create this post at my github account - or .md file.

Written in Markdown, with help from knitr.

To leave a comment for the author, please follow the link and comment on his blog: Recology - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.