The rOpenSci projects aims to provide programmatic access to scientific data repositories on the web. A vast majority of the packages in our current suite retrieve some form of biodiversity or taxonomic data. Since several of these datasets have been georeferenced, it provides numerous opportunities for visualizing species distributions, building species distribution maps, and for using it analyses such as species distribution models. In an effort to streamline access to these data, we have developed a package called Spocc, which provides a unified API to all the biodiversity sources that we provide. The obvious advantage is that a user can interact with a common API and not worry about the nuances in syntax that differ between packages. As more data sources come online, users can access even more data without significant changes to their code. However, it is important to note that spocc will never replicate the full functionality that exists within specific packages. Therefore users with a strong interest in one of the specific data sources listed below would benefit from familiarising themselves with the inner working of the appropriate packages.
spocc currently interfaces with five major biodiversity repositories. Many of these packages have been part of the rOpenSci suite:
Global Biodiversity Information Facility (
GBIF is a government funded open data repository with several partner organizations with the express goal of providing access to data on Earth's biodiversity. The data are made available by a network of member nodes, coordinating information from various participant organizations and government agencies.
Berkeley Ecoengine (
The ecoengine is an open API built by the Berkeley Initiative for Global Change Biology. The repository provides access to over 3 million specimens from various Berkeley natural history museums. These data span more than a century and provide access to georeferenced specimens, species checklists, photographs, vegetation surveys and resurveys and a variety of measurements from environmental sensors located at reserves across University of California's natural reserve system. (related blog post)
iNaturalist provides access to crowd sourced citizen science data on species observations.
rgbif, ecoengine, and
rbison(see below), VertNet provides access to more than 80 million vertebrate records spanning a large number of institutions and museums primarly covering four major disciplines (mammology, herpetology, ornithology, and icthyology). Note that we don't currenlty support VertNet data in this package, but we should soon
Biodiversity Information Serving Our Nation (
Built by the US Geological Survey's core science analytic team, BISON is a portal that provides access to species occurrence data from several participating institutions.
ebird is a database developed and maintained by the Cornell Lab of Ornithology and the National Audubon Society. It provides real-time access to checklist data, data on bird abundance and distribution, and communtiy reports from birders.
AntWeb is the world's largest online database of images, specimen records, and natural history information on ants. It is community driven and open to contribution from anyone with specimen records, natural history comments, or images. (related blog post)
Note: It's important to keep in mind that several data providers interface with many of the above mentioned repositories. This means that occurence data obtained from BISON may be duplicates of data that are also available through GBIF. We do not have a way to resolve these duplicates or overlaps at this time but it is an issue we are hoping to address in future versions of the package.
Installing the package
install.packages("spocc") # or install the most recent version devtools::install_github("ropensci/spocc") library(spocc)
Searching species occurrence data
The main workhorse function of the package is called
occ. The function allows you to search for occurrence records on a single species or list of species and from particular sources of interest or several. The main input is a
query with sources specified under the argument
from. So to look at a really simply query:
results <- occ(query = 'Accipiter striatus', from = 'gbif') results #> Summary of results - occurrences found for: #> gbif : 25 records across 1 species #> bison : 0 records across 1 species #> inat : 0 records across 1 species #> ebird : 0 records across 1 species #> ecoengine : 0 records across 1 species #> antweb : 0 records across 1 species
This returns the results as an S3 class with a slot for each data source. Since we only requested data from
gbif, the remaining slots are empty. To view the data:
results$gbif #> $meta #> $meta$source #>  "gbif" #> #> $meta$time #>  "2014-03-16 17:39:31.716 PDT" #> #> $meta$query #>  "Accipiter striatus" #> #> $meta$type #>  "sci" #> #> $meta$opts #> list() #> #> #> $data #> $data$Accipiter_striatus #> name key longitude latitude prov #> 1 Accipiter striatus 891040018 -97.65 30.158 gbif #> 2 Accipiter striatus 891040169 -122.44 37.490 gbif #> 3 Accipiter striatus 891035119 -71.73 18.270 gbif #> 4 Accipiter striatus 891035349 -72.53 43.132 gbif #> 5 Accipiter striatus 891038901 -97.20 32.860 gbif #> 6 Accipiter striatus 891048899 -73.07 43.632 gbif #> 7 Accipiter striatus 891049443 -99.10 26.491 gbif #> 8 Accipiter striatus 891050439 -97.88 26.102 gbif #> 9 Accipiter striatus 891043765 -76.64 41.856 gbif #> 10 Accipiter striatus 891056214 -117.15 32.704 gbif #> 11 Accipiter striatus 891054792 -73.24 44.315 gbif #> 12 Accipiter striatus 768992325 -76.10 4.724 gbif #> 13 Accipiter striatus 859267562 -108.34 36.732 gbif #> 14 Accipiter striatus 859267548 -108.34 36.732 gbif #> 15 Accipiter striatus 859267717 -108.34 36.732 gbif #> 16 Accipiter striatus 891043784 -73.05 43.605 gbif #> 17 Accipiter striatus 891118711 -122.18 37.786 gbif #> 18 Accipiter striatus 891116600 -97.32 32.821 gbif #> 19 Accipiter striatus 891124493 -117.11 32.632 gbif #> 20 Accipiter striatus 891125442 -122.88 38.612 gbif #> 21 Accipiter striatus 891127900 -122.36 37.778 gbif #> 22 Accipiter striatus 891128609 -97.98 32.761 gbif #> 23 Accipiter striatus 891121966 -76.55 38.672 gbif #> 24 Accipiter striatus 868487120 -83.83 42.333 gbif #> 25 Accipiter striatus 891131416 -72.59 43.853 gbif
If you prefer data from more than one source, simply pass a vector of source names for the
from argument. Example:
occ(query = 'Accipiter striatus', from = c('ecoengine', 'gbif')) #> Summary of results - occurrences found for: #> gbif : 25 records across 1 species #> bison : 0 records across 1 species #> inat : 0 records across 1 species #> ebird : 0 records across 1 species #> ecoengine : 25 records across 1 species #> antweb : 0 records across 1 species
We can also search for multiple species across multiple engines.
species_list <- c("Accipiter gentilis", "Accipiter poliogaster", "Accipiter badius") res_set <- occ(species_list, from = c('gbif', 'ecoengine'))
Similarly, we can search for data on the Sharp-shinned Hawk from other data sources too.
occ(query = 'Accipiter striatus', from = 'ecoengine') # or look for data on other species occ(query = 'Danaus plexippus', from = 'inat') occ(query = 'Bison bison', from = 'bison') occ(query = "acanthognathus brevicornis", from = "antweb")
occ is also extremely flexible and can take package specific arguments for any source you might be querying. You can pass these as a list under
ecoengine_opts). See the help file for
?occ for more information.
Visualizing biodiversity data
We provide several methods to visualize the resulting data. Current options include Leaflet.js, ggmap, a Mapbox implementation in a GitHub gist, or a static map.
Mapping with Leaflet
spp <- c("Danaus plexippus", "Accipiter striatus", "Pinus contorta") dat <- occ(query = spp, from = "gbif", gbifopts = list(georeferenced = TRUE)) # occ2df, as the name suggests converts data contained inside an occ class to a R data.frame data <- occ2df(dat) mapleaflet(data = data, dest = ".")
Render a geojson file automatically as a GitHub gist
To have a map automatically posted as a gist, you'll need to set up your GitHub credentials ahead of time. You can either pass these as variables
github.password, or store them in your options (taking regular precautions as you would with passwords of course). If you don't have these stored, you'll be prompted to enter them before posting.
spp <- c("Danaus plexippus", "Accipiter striatus", "Pinus contorta") dat <- occ(query = spp, from = "gbif", gbifopts = list(georeferenced = TRUE)) dat <- fixnames(dat) dat <- occ2df(dat) mapgist(data = dat, color = c("#976AAE", "#6B944D", "#BD5945"))
If interactive maps aren't your cup of tea, or you prefer to have one that you can embed in a paper, try one of our static map options. You can go with the more elegant
ggmap option or stick with something from base graphics.
ecoengine_data <- occ(query = "Lynx rufus californicus", from = "ecoengine") mapggplot(ecoengine_data)
spnames <- c("Accipiter striatus", "Setophaga caerulescens", "Spinus tristis") base_data <- occ(query = spnames, from = "gbif", gbifopts = list(georeferenced = TRUE)) plot(base_data, cex = 1, pch = 10)
- As soon as we have an updated rvertnet package, we'll add the ability to query VertNet data from spocc.
- We will add rCharts as an official import once the package is on CRAN (Eta end of March)
- We'll add a function to make interactive maps using RStudio's Shiny in a future version.