Use case: combining taxize and rgbif

[This article was first published on rOpenSci » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sure thing….this is just the sort of thing for which rOpenSci is being built.

A colleague of mine recently saw our packages in development and thought, “Hey, that could totally make my life easier.”   What was made easier you ask?   This was his situation:

He had a list of ca. 1200 species of birds and wanted to first obtain the most current species names before seeking location data for occurrences of all the species.

So what tools do we need for this?  We need the packages taxize and rgbif:

  • taxize: The taxize package allows you to search taxonomic information across the Universal Biological Indexer and Organizer (uBio), Integrated Taxonomic Information Service (ITIS), Encyclopedia of Life (EOL), the Taxonomic Name Resolution Service (TNRS), and Phylomatic.
  • rgbif: The rgbif package allows you to search for and retrieve data from the Global Biodiversity Information Facility.

If you want to run this code, the entire workflow is here, as a GitHub Gist.

First step: check names

Note that we are using a subset of the data in my friend’s actual dataset for brevity here. So 1200 species down to 10 species for our purposes.

Let’s just wrap up all the dirty work into one function called checkname.  This function uses a few taxize functions, including get_tsn, and getacceptname.

checkname <- function(name) {
  # name: scientific name
  # get taxonomic serial number (TSN)
  if(class(try(tsn <- get_tsn(name, "sciname", by_="name"), silent = T)) == "try-error")
    {tsn <- "no_results"}
  # check accepted name
  out <- getacceptname(tsn)
  if(out[[1]] == "no_results") {list("check_spelling", name, "check_spelling", out)} else
    if(length(out) == 2) {list("new_name", name, as.character(out)[[1]], as.character(out)[[2]])} else
      if(class(as.numeric(out)) == "numeric") {list("good_name", name, name, out)}
}

Nice.  Now let’s run our species list through the function checkname using llply function from the plyr package.

ournames <- read.csv("birdlist_ten.csv")

itisout <- llply(ournames[, 1], checkname, .progress = "text") # query ITIS
  |======================================================================================| 100%

dfnames <- ldply(itisout, function(x) { # make a data frame of results
    out_ <- as.data.frame(x)
    names(out_) <- c("status", "name_old", "name_new", "TSN")
    out_})

dfnames
           status                name_old                 name_new        TSN
1  check_spelling Agapornis_roseicapillis           check_spelling no_results
2        new_name  Catharacta_maccormicki Stercorarius maccormicki     660062
3        new_name         Catharacta_skua        Stercorarius skua     660059
4       good_name          Cathartes_aura           Cathartes_aura     175265
5       good_name      Catharus_bicknelli       Catharus_bicknelli     554148
6       good_name     Catharus_fuscescens      Catharus_fuscescens     179796
7       good_name       Catharus_guttatus        Catharus_guttatus     179779
8       good_name        Catharus_minimus         Catharus_minimus     179793
9       good_name      Catharus_ustulatus       Catharus_ustulatus     179788
10       new_name      Ceratogymna_brevis        Bycanistes brevis     707796

It looks like we have one name spelled wrong (“check_spelling”), three name replacements (“new_name”), and the remainder checked out just fine with ITIS.
Now we need to remove that one species with the spelling problem for now (although you would fix it of course if it was your project). Then we feed the new species list to queries to GBIF.

p.s. The output from above spits out TSNs too, which you can use to query for more taxonomic information for species through the taxize package.

Second step: get lat/long data

dfnames$gbifname <- gsub("_", " ", dfnames[,3]) # create new name column

dfnames # we now have a column of names without the underscore for GBIF search
           status                name_old                 name_new        TSN                 gbifname
1  check_spelling Agapornis_roseicapillis           check_spelling no_results           check spelling
2        new_name  Catharacta_maccormicki Stercorarius maccormicki     660062 Stercorarius maccormicki
3        new_name         Catharacta_skua        Stercorarius skua     660059        Stercorarius skua
4       good_name          Cathartes_aura           Cathartes_aura     175265           Cathartes aura
5       good_name      Catharus_bicknelli       Catharus_bicknelli     554148       Catharus bicknelli
6       good_name     Catharus_fuscescens      Catharus_fuscescens     179796      Catharus fuscescens
7       good_name       Catharus_guttatus        Catharus_guttatus     179779        Catharus guttatus
8       good_name        Catharus_minimus         Catharus_minimus     179793         Catharus minimus
9       good_name      Catharus_ustulatus       Catharus_ustulatus     179788       Catharus ustulatus
10       new_name      Ceratogymna_brevis        Bycanistes brevis     707796        Bycanistes brevis

dfnames <- dfnames[-1,] # remove row 1

gbiftestout <- llply(as.list(dfnames[,5]), function(x) occurrencelist(x, coordinatestatus = TRUE, maxresults = 10, latlongdf = TRUE))

gbiftestout[[1]] # here's the data frame of results from one species
                    sciname latitude longitude
1  Stercorarius maccormicki 36.65685 -121.9187
2  Stercorarius maccormicki 36.85800 -122.0910
3  Stercorarius maccormicki 46.89017 -125.0051
4  Stercorarius maccormicki 36.85800 -122.0910
5  Stercorarius maccormicki 36.65685 -121.9187
6  Stercorarius maccormicki 40.76234 -124.2363
7  Stercorarius maccormicki 36.85800 -122.0910
8  Stercorarius maccormicki 36.85800 -122.0910
9  Stercorarius maccormicki 36.85800 -122.0910
10 Stercorarius maccormicki 40.76234 -124.2363

gbiftestout_df <- ldply(gbiftestout, identity) # make a data frame of all results

rbind(head(gbiftestout_df), tail(gbiftestout_df)) # look at first and last 6 rows
                    sciname latitude longitude
1  Stercorarius maccormicki 36.65685 -121.9187
2  Stercorarius maccormicki 36.85800 -122.0910
3  Stercorarius maccormicki 46.89017 -125.0051
4  Stercorarius maccormicki 36.85800 -122.0910
5  Stercorarius maccormicki 36.65685 -121.9187
6  Stercorarius maccormicki 40.76234 -124.2363
85        Bycanistes brevis -0.16700   37.3170
86        Bycanistes brevis  0.31700   32.5830
87        Bycanistes brevis -0.16700   37.3170
88        Bycanistes brevis -0.16700   37.3170
89        Bycanistes brevis  0.05000   37.6500
90        Bycanistes brevis  0.05000   37.6500

Beauty!  That just saved a lot of time I reckon.

Of course there are many more options within the functions to grab data from GBIF – I only show retrieval of latitude and longitude data for species here.

Third step: make some maps

install.packages("maps")
require(ggplot2)
try_require("maps")

world <- map_data("world")
mexico <- subset(world, region=="Mexico")
# Make a plot for Stercorarius maccormicki
ggplot(world, aes(long, lat)) +
  geom_polygon(aes(group = group), fill = "white", color = "gray40", size = .2) +
  geom_jitter(data = gbiftestout[[1]],
    aes(longitude, latitude), alpha=0.6, size = 4, color = "blue") +  
  opts(title = "Stercorarius maccormicki")

# Make a plot for Catharus guttatus, just in Mexico though
ggplot(mexico, aes(long, lat)) +
  geom_polygon(aes(group = group), fill = "white", color = "gray40", size = .2) +
  geom_jitter(data = gbiftestout[[6]],
    aes(longitude, latitude), alpha=0.6, size = 4, color = "blue") +
  opts(title = "Catharus guttatus")

Here’s the two maps, first for Stercorarius maccormicki, and then for Catharus guttatus

 

 

Fourth step: smile and get back to us

Wasn’t that easy?  So much better than checking names one by one manually, then retrieving data from GBIF manually, both through web interfaces.

Please tell us here, or on Twitter, what other use cases you can think of!

Again, if you want to run this code, the entire workflow is here, as a GitHub Gist.  And the species list is below.

The species list:


genus_species
Agapornis_roseicapillis
Catharacta_maccormicki
Catharacta_skua
Cathartes_aura
Catharus_bicknelli
Catharus_fuscescens
Catharus_guttatus
Catharus_minimus
Catharus_ustulatus
Ceratogymna_brevis

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)