Scraping Flora of North America

January 27, 2012
By

(This article was first published on Recology - R, and kindly contributed to R-bloggers)

So Flora of North America is an awesome collection of taxonomic information for plants across the continent. However, the information within is not easily machine readable.

So, a little web scraping is called for.

rfna is an R package to collect information from the Flora of North America.

So far, you can: 1. Get taxonomic names from web pages that index the names. 2. Then get daughter URLs for those taxa, which then have their own 2nd order daughter URLs you can scrape, or scrape the 1st order daughter page. 3. Query Asteraceae taxa for whether they have paleate or epaleate receptacles. This function is something I needed, but more functions will be made like this to get specific traits.

Further functions will do search, etc.

You can install by:

install.packages("devtools")
require(devtools)
install_github("rfna", "rOpenSci")
require(rfna)

Here is an example where a set of URLs is acquired using function getdaughterURLs, then the function receptacle is used to ask whether of each the taxa at those URLs have paleate or epaleate receptacles.

To leave a comment for the author, please follow the link and comment on his blog: Recology - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.