So, What Are You? ..A Plant? ..An Animal? — Nope, I’m a Fungus!

November 28, 2012

(This article was first published on theBioBucket*, and kindly contributed to R-bloggers)

Lately I had a list of about 1000 species names and I wanted to filter out only the plants as that is where I come from. I knew that Scott Chamberlain has put together the ritis package which obviously can do such things. However, I knew of ITIS before and was keen to give it a shot..

Here’s what I’ve come up with (using the ITIS API, updated on 11. Dec 2012, previous version had a flaw with indefinite matches.. Should be ok now. However, there are of course species that are not covered by the database, i.e. Ixodes, see below):

get_tsn <- function(sp_name) {
units <- tolower(unlist(strsplit(sp_name, " ")))

# valid string?
if (length(units) > 2) { stop("...No valid search string submitted (two words seperated by one space)!") }

itis_xml <- htmlParse(paste("",
sp_name, sep=""))
tsn <- xpathSApply(itis_xml, "//tsn", xmlValue)
unitname1 <- tolower(gsub("\\s+", "", xpathSApply(itis_xml, "//unitname1", xmlValue)))
unitname2 <- tolower(gsub("\\s+", "", xpathSApply(itis_xml, "//unitname2", xmlValue)))
unitname3 <- tolower(gsub("\\s+", "", xpathSApply(itis_xml, "//unitname3", xmlValue)))

# sp_name = only Genus, get tsn were sp_name matches perfectly and unitname2 (lower level taxon) is absent
if (length(units) == 1) {
return(tsn[tolower(sub("\\s+", "", unitname1)) == tolower(sp_name) & unitname2 == ""]) }

# sp_name = Genus and Epitheton, get tsn where both match perfectly and unitname3 (lower level taxon) is absent
if (length(units) == 2) {
return(tsn[unitname1 == units[1] &
unitname2 == units[2] &
nchar(unitname3) == 0]) }

get_kngdm <- function(tsn) {
kngdm <- xpathSApply(htmlParse(paste("",
tsn, sep="")),
"//kingdomname", xmlValue)

get_tsn_kngdm <- function(x) {y = get_tsn(x)
z = get_kngdm(y)
return(list(Name = x, TSN = y, Kingdom = z))

# I had some API-related errors (I guess it was mysteriously not answering in
# some cases). I couldn't resolve this and thus implemented tryCatch
get_tsn_kngdm_try <- function(x) tryCatch(get_tsn_kngdm(x), error = function(e) NULL)

sp_names <- c("Clostridium", "Physcia", "Ixodes", "LYNX", "Homo sapiens", "Canis lupus")

system.time(result <- data.frame(, lapply(sp_names, FUN = get_tsn_kngdm_try))))

system.time(result <- data.frame(, lapply(sp_names, FUN = get_tsn_kngdm_try))))
# result
# User System verstrichen
# 1.54 0.01 33.66
# Name TSN Kingdom
# 1 Clostridium 555645 Monera
# 2 Physcia 14024 Fungi
# 3 Viola 22030 Plantae
# 4 Ixodes
# 5 LYNX 180581 Animalia
# 6 Homo sapiens 180092 Animalia
# 7 Canis lupus 180596 Animalia

To leave a comment for the author, please follow the link and comment on their blog: theBioBucket*. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)