The COVID-19 pandemic has dramatically impacted all of our lives in a very short period of time. Spring and summer are usually very busy as students prepare to go the field to engage in various data collection efforts. The pandemic has also disrupted these carefully planned activities as travel is suspended and local and remote field stations have closed indefinitely. A lost field season can be a major setback for a dissertation timeline and students will have to improvise. One promising opportunity to continue research efforts during these unprecedented times is taking advantage of the massive amounts of open scientific data that are freely available. Open data can form the basis of a review, synthesis, or new research.
Inspired by tweets from Ethan White about “PhD research from a distance”, the rOpenSci team did an in-depth exploration of how we provide access to open data. Our goal is to inspire students to find research opportunities with open data and highlight some of the rOpenSci packages that already make programmatic access possible. We also highlight some examples of how specific collections of packages are being used right now in fields as varied as archaeology and climate science.
Exploring open data
Data are fundamental to scientific discovery and leveraging new discoveries would not be possible without access to data 1.
Although people rarely develop new research entirely on open data, these datasets provide an opportunity to reproduce and validate existing results, improve models, and be combined with other data to generate new syntheses.
The open science movement has been growing for over a decade and all of that interest has surfaced numerous databases and repositories. The growing interest in reproducibility has also led to the creation of a plethora of open source software to access such data.
rOpenSci’s core mission is to develop such tools and to date we have built over 120 robust data-access packages.
These packages provide access to an impressive variety and quantity of data:
eBird offers up 700 million observations, Crossref has 108 million records of scholarly works which include articles and books, Dryad makes available 13 terabytes of data associated with published papers, and GBIF has over 1.3 billion records of species worldwide.
We hope that this post and these tools provide inspiration for you to explore new data sources and research topics.
Data sources for your research
Many of rOpenSci’s tools are developed by practicing scientists and have strong communities behind them. We invited university faculty from our community of developer-researchers to highlight sources of open data for research in their fields.
Climate and weather
Brooke Anderson, Colorado State University
Research on weather and climate—and their impacts on humans and the environment—can draw on numerous excellent open data sources, including many made available through programmatic access to data collected and shared by institutions and monitoring networks. The US Geological Survey offers a particular exciting example, offering not only APIs for accessing their data, but also a full suite of R packages developed and shared through the USGS-R community. rOpenSci’s own rnoaa package provides access to data through a number of the US National Oceanic and Atmospheric Administration’s open data APIs, allowing for fast and convenient access from R to national or worldwide data on, among others, meteorological observations, sea ice, and tides and currents, while its bomrang package offers similar access to data from the Australian Government Bureau of Meteorology. Other rOpenSci packages provide access to weather- and climate-related data from the Iowa Environment Mesonet (riem), New Zealand’s National Climate Database (clifro), the US National Aeronautics and Space Administration’s Prediction of Worldwide Energy Resource (POWER) dataset (nasapower), the US National Centers for Environmental Information’s Global Surface Summary of the Day (GSOD) dataset (GSODR), the US National Hurricane Center (rrricanes), the Flanders Environment Agency and Flanders Hydraulics Research’s waterinfo.be dataset (wateRinfo), and Environment and Climate Change Canada (ECCC) (weathercan). bowerbird is general-purpose package for maintaining local copies of a range of satellite- and model-derived environmental and climate data.
Louise Slater, University of Oxford, Sam Zipper, University of Kansas, Ilaria Prosdocimi, Ca ‘Foscari University, Sam Albers, Government of British Columbia, and Claudia Vitolo, European Centre for Medium Range Weather Forecasts
In hydrology, there has been a rapid growth in the number of streamflow data archives made publicly available online by countries such as the UK (rnrfa package), USA (dataRetrieval package), Greece (rOpenSci’s hydroscoper package), and Canada (rOpenSci’s tidyhydat package) although most countries sadly do not yet apply an open policy to their hydrological data. The Task View on Hydrological Data and Modelling and accompanying blog post Getting your toes wet in R: Hydrology, meteorology, and more provide an exciting overview of the most up-to-date R packages that are available for downloading, analysing, and modelling these data. For an overview of the many advantages of using R for hydrological research, see the paper “Using R in Hydrology” 2 which describes approaches to retrieve, analyse, map, model, and visualise hydrological data.
Antarctic and Southern Ocean
Ben Raymond, Australian Antarctic Division and Anton Van de Putte, Royal Belgian Institute for Natural Science
Antarctic science has a strong culture of open data – the Antarctic treaty itself states that scientific observations and results from Antarctica should be openly shared, and the Scientific Committee on Antarctic Research has had an active data management group since the late 1980s. To find Antarctic and Southern Ocean data, search the Antarctic master directory (metadata catalogue) or portals such as the Antarctic Biodiversity portal or the Southern Ocean Observing System.
The Antarctic rOpenSci community is developing R resources to support Antarctic and Southern Ocean science, with a particular emphasis on simplifying data access and performing common analytical tasks. See this blog post and task view for an overview of some of the packages in development, and the types of analyses that we are aiming to support.
Ben Marwick, University of Washington
Research shuddered to a stop in the Geoarchaeology Lab in early March, with UW being one of the first US campuses to switch to remote work. No longer able to go to campus, we turned our attention to computational text analysis of a large corpus of archaeological conference abstracts to look at questions about gender imbalance and theory change in our field. Our quick pivot to this new area was only possible thanks to high quality and well-documented software such as rOpenSci’s tesseract, pdftools and magick packages. These enabled us to generate data rapidly, giving us more time for exploring and testing hypotheses, and ensuring our students could get to the end of the term ready to share some really interesting results.
We’ve been keeping up with the literature through in-depth study of new journal articles, especially those that include open data. Archaeologists use specialised repositories such as the Digital Archaeological Record (tDAR), Open Context as well as several generic repositories to share data (e.g. Zenodo, Figshare, Dataverse – each of these have R packages to access data). There are R packages for accessing data hosted by those archaeology repositories (tdar, opencontext), but many of our favourite recent articles (we keep a list here) had their data openly archived on the Open Science Framework data repository. While studying these articles we have enjoyed using rOpenSci’s osfr package to quickly and reproducibly access these materials for in-depth exploration. A favourite type of data for many archaeologists is radiocarbon ages, and our group has also been working with these with ease thanks to the c14bazAAR package. We’ve been using this package to get data to study radiocarbon dates from hundreds of archaeological sites in Australia. While we’re missing the lab, rOpenSci’s packages for acquiring archaeological data have been invaluable tools for efficiently enabling us to be active and engaged in our research.
Our task view for archaeological science shows the full range of tools we use, from data acquisition through environmental and geological analysis to writing reproducible manuscripts.
Robin Lovelace, University of Leeds
There has never been a better time for data driven and reproducible transport research. The COVID-19 pandemic has disrupted transport patterns worldwide. This has led to changes, such as the construction of ‘pop-up’ active transport infrastructure, the prioritisation of which can be supported by reproducible and open data analysis, as outlined in preprint (the analysis of which was undertaken in R) on the topic 3. There is a wealth of data out there that can be found with careful search queries and many new datasets (like Uber’s micromobility datasets, released on May 6th of this year).
For open origin-destination data there are many resources but the PCT package provides a way to access national-scale datasets quickly from the R command line, as outlined stplanr‘s Origin-destination vignette.
For road safety data there is a lack of open data in many countries but you can access national road casualty data, with 60+ variables and 100,000+ records each year with the stats19 package.
For inspiration, I recommend checking out the Propensity to Cycle Tool, an interactive free and open web app that is being used to inform active transport investment plans in dozens of cities across the UK (it also has many data download options at zone, route and route network levels).
Taxonomy, biodiversity, ecology
rOpenSci has its roots in software for biodiversity research, with many packages in the areas of taxonomy, biological occurrences, and natural history/traits.
taxonomy: A good place to start is the taxonomy task view, covering many options for working with online taxonomy data
occurrences: Occurrence data forms the basis of much ecological research. The largest source of occurrence data, GBIF, can be accessed with the rgbif package. Many more are listed in the README for the package spocc.
natural history/traits: Conservation researchers may want to fetch data from the IUCN Red List via rredlist, Fishbase life history data from rfishbase, bird data from auk or rebird, or trait data from various marine taxa in WoRMS (called “attributes” by WoRMS; worrms).
A good general resource for rOpenSci packages on biodiversity is the rOpenSci Community Call from March 2019: Research Applications of rOpenSci Taxonomy and Biodiversity Tools.
Browse our table of > 100 data-access packages (under the bird) or jump ahead to see where you come in.
rOpenSci data-access packages
The table below shows a subset of our full suite of R packages. Click on the package name to see a list of scientific use cases.
|Data and source
|Antarctic geographic names. Composite Gazetteer of Antarctica
|Ant data. AntWeb database from the California Academy of Sciences
|bird sighting records. http://ebird.org
|Historic ride data from public hire bicycle systems. London, U.K., from the U.S.A., San Francisco CA, New York City NY, Chicago IL, Washington DC, Boston MA, Los Angeles LA, Philadelphia PA, Minnesota, Montreal, Canada, and Guadalajara, Mexico.
|genomic data retrieval. ‘NCBI RefSeq’, ‘NCBI Genbank’, ‘ENSEMBL’, and ‘UniProt’ databases, plus interface to ‘BioMart’ database
|Bittrex crypto-currency exchange. https://bittrex.com
|Bold Systems for genetic barcode data. http://www.boldsystems.org
|phylogenetic data. ‘Phylomatic’ http://phylodiversity.net/phylomatic, and ‘Phylocom’ https://github.com/phylocom/phylocom
|Time series of global, direct, and diffuse irradiations on horizontal surface. Copernicus Atmosphere Monitoring Service (CAMS)
|Climate Change, Agriculture, and Food Security (CCAFS) General Circulation Models.
|Chromosome Counts Database. http://ccdb.tau.ac.il
|Paula Andrea Martinez
|New Zealand National Climate Database. https://cliflo.niwa.co.nz
|United Nations Comtrade data. https://comtrade.un.org/data
|transcription factor/microRNA-gene correlations (co-expression) in cancer. Cistrome Cancer Liu et al. (2011) doi:10.1186/gb-2011-12-8-r83 and ‘miRCancerdb’ databases (in press).
|OPeNDAP servers. https://www.opendap.org
|South Florida Water Management Districts DBHYDRO’ database. https://www.sfwmd.gov/science-data/dbhydro
|Drosophila odorant response data for DoOR.functions.
|Georeferenced specimen records from the University of California, Berkeley’s Natural History Museums. https://ecoengine.berkeley.edu
|reading and parsing of internal e-book content from EPUB files. EPUB e-books.
|European Social Survey data. http://www.europeansocialsurvey.org
|Geospatial data from several federated data sources (mainly sources maintained by the US federal government). National Elevation Dataset National Hydrography Dataset (USGS), The Soil Survey Geographic (SSURGO) database, the Global Historical Climatology Network (GHCN), the Daymet gridded estimates of daily weather parameters, the International Tree Ring Data Bank, and the National Land Cover Database (NLCD).
|R. Kyle Bocinsky
|Data for many indicators of public health in England. http://fingertips.phe.org.uk
|Historical datasets of first names and dates of birth.
|University of East Anglia Climate Research Unit gridded climatology of monthly means. https://crudata.uea.ac.uk/cru/data/hrg/tmc/readme.txt
|Landsat 8 Data. https://registry.opendata.aws/landsat-8
|Global Surface Summary of the Day (GSOD) weather data from USA National Centers for Environmental Information (NCEI). http://www1.ncdc.noaa.gov/pub/data/gsod/readme.txt
|public GTFS feeds.
|Project Gutenberg collection. http://www.gutenberg.org
|HathiTrust bibliographic API. https://www.hathitrust.org
|hydrological data. various data providers
|London Natural History Museum’s host-parasite database. http://www.nhm.ac.uk/research-curation/scientific-resources/taxonomy-systematics/host-parasites
|sample data sets for historians on population, institutional, religious, military, and prosopographical data.
|Greek National Data Bank for Hydrological and Meteorological Information. http://www.hydroscope.gr
|Internet Archive. https://archive.org/
|NOAA Integrated Surface Data. https://www.ncdc.noaa.gov/isd
|Directory of Open Access Journals. https://doaj.org
|time series of rasters from MODIS Satellite Land Products data.
|museum metadata. Many different museums, including the MET, Getty Museum, and more
|NASA POWER (Prediction Of Worldwide Energy Resource) global meteorology and surface solar energy climatology data. https://power.larc.nasa.gov
|Adam H. Sparks
|paleoecological datasets from the Neotoma Paleoecological Database. http://api.neotomadb.org
|Simon J. Goring
|UK official statistics from the Nomis database, including data from the from the Census, the Labour Force Survey, DWP benefit statistics and other economic and demographic data from the Office for National Statistics. https://www.nomisweb.co.uk/api/v01/help
|Transcriptomes of over 1000 plant species.. The 1000 Plants Initiative (www.onekp.com)
|Open Context data. https://opencontext.org
|Species origin data from multiple sources. Encyclopedia of Life (http://eol.org), Flora ‘Europaea’ (http://rbg-web2.rbge.org.uk/FE/fe.html), Global Invasive Species Database (http://www.iucngisd.org/gisd), the Native Species Resolver (http://bien.nceas.ucsb.edu/bien/tools/nsr/), Integrated Taxonomic Information Service (http://www.itis.gov/), and Global Register of Introduced and Invasive Species (http://www.griis.org/).
|OpenStreetMap data. https://openstreetmap.org
|Ocean time series datasets, including BATS, HOT, and more.
|PaleobioDB fossil data. http://paleobiodb.org/data1.1
|Pangaea Database. https://www.pangaea.de
|Orthologous sequence clusters within taxonomic groups from GenBank. https://www.ncbi.nlm.nih.gov/genbank
|Pleiades data. https://pleiades.stoa.org
|Oregon State Prism climate data. http://www.prism.oregonstate.edu/
|Survey results from the Qualtrics API. https://www.qualtrics.com/about
|proyectoavis database. http://proyectoavis.com
|Bielefeld Academic Search Engine (BASE) of more than 150 million scholarly documents from more than 7000 sources. https://www.base-search.net
|Biodiversity Heritage Library (BHL) of digitized literature on biodiversity studies. https://www.biodiversitylibrary.org
|USGS BISON database for species occurrence data from the United States. https://bison.usgs.gov
|Libraries.io data from 36 different package managers for programming languages. https://libraries.io/api
|CORE API aggregates open access research outputs from repositories and journals. https://core.ac.uk/docs
|DataCite metadata. https://www.datacite.org
|Data Retriever. http://data-retriever.org
|DEFRA’s UK-AIR website. https://uk-air.defra.gov.uk
|DOPA (Digital Observatory for protected Areas) by the European Union Joint Research Centre.
|Digital Public Library of America. https://dp.la
|Dryad \Solr\ data underlying scientific publications. https://datadryad.org
|eBird database of bird observations and locations. https://ebird.org/home
|NCBIs EUtils API for databases like GenBank and PubMed’. https://www.ncbi.nlm.nih.gov/genbank https://www.ncbi.nlm.nih.gov/pubmed
|ERDDAP servers. https://upwell.pfeg.noaa.gov/erddap/information.html
|Europeana web services. http://labs.europeana.eu/api
|Fishbase data on over 30,000 species of fish, their biology, ecology, morphology and more. http://www.fishbase.org http://www.sealifebase.org
|Flora of North America website data. http://www.efloras.org
|Global Biodiversity Information Facility (GBIF) data of species occurrence. https://www.gbif.org/developer/summary
|Global Biotic Interactions (GloBI) data on spatial-temporal species interactions. https://www.globalbioticinteractions.org/
|Global Population Dynamics Database. https://ecologicaldata.org/wiki/global-population-dynamics-database
|Weather data from Automated Surface Observing System (ASOS) stations. Iowa Environment Mesonet website.
|Neuroscience Information Framework (NIF) data. https://neuinfo.org
|iNaturalist website of species occurrence data submitted by citizen scientists.. http://inaturalist.org
|Vector map data. http://www.naturalearthdata.com
|Many NOAA data sources including NCDC climate data, and data on sea ice, severe weather, historical metadata, storm and tornado data. https://www.ncdc.noaa.gov/cdo-web/webservices/v2
|National Phenology Network data on various life history events that occur at specific times. https://usanpn.org
|air quality data from the OpenAQ platform. https://docs.openaq.org
|Open Tree of Life data on phylogenetic trees. https://tree.opentreeoflife.org/
|Perseus Digital Library collection of classical texts. http://cts.perseids.org
|Global Plant Phenology Data Portal. https://www.plantphenology.org
|IUCN Red List of threatened and endangered species. http://apiv3.iucnredlist.org/api/v3/docs
|Data on past and current hurricanes and tropical storms for the Atlantic and eastern Pacific oceans. https://www.nhc.noaa.gov/archive/1998/1998archive.shtml
|Storm discussions, forecast/advisories, public advisories, wind speed probabilities, strike probabilities and more. National Hurricane Center
|SNP datasets for SNPs, genotypes, and phenotypes. https://opensnp.org https://www.ncbi.nlm.nih.gov/projects/SNP
|United States Department of Agriculture (USDA) data from the Systematic Mycology and Microbiology Laboratory (SMML).
|VertNet.org archives including taxonomic names, places, and dates. http://vertnet.org
|Model predictions from 15 different global circulation models in 20 years.
|air transport statistics from the Bureau of Transport Statistics (BTS) in the United States. https://www.transtats.bts.gov/databases.asp?Mode_ID=1&Mode_Desc=Aviation&Subject_ID2=0
|NASA Soil Moisture Active Passive (SMAP) data. https://smap.jpl.nasa.gov/
|data from Solr. https://lucene.apache.org/solr
|species occurrence data sources, including Global Biodiversity Information.
|Supplementary materials from published manuscripts,.
|William D. Pearse
|Historical and real-time national hydrometric data from Water Survey of Canada data sources. http://dd.weather.gc.ca/hydrometric/csv http://collaboration.cmc.ec.gc.ca/cmc/hydrometrics/www
|Access Open Trade Statistics API from R to download international trade data..
|Species trait data from many different sources, including sequence data from from NCBI, plant trait data from BETYdb, plant data from the USDA plants database, data from EOL Traitbank, Coral traits data, Birdlife International, and more..
|TreeBASE repository of phylogenetic trees (of species, population, or genes). http://treebase.org
|Boundaries for geographical units in the United States of America. U.S. Census Bureau, Newberry Library’s ‘Atlas of Historical County Boundaries’
|Higher resolution boundary data, for use in the USAboundaries package.. U.S. Census Bureau, the Newberry Library’s ‘Historical Atlas of U.S. County Boundaries’, and Erik Steiner’s ‘United States Historical City Populations, 1790-2010’.
|Historical weather data from Environment and Climate Change Canada. http://climate.weather.gc.ca/historical_data/search_historic_data_e.html
|Chemical information from around the web..
This is where you come in!
Have you successfully used one or more of these data sources in your research? We want others to imagine what’s possible by seeing examples. Share your story in the comments and cite your paper or preprint if it’s published.
Is there a data source you want to access programmatically but there’s no R package to do that? Tell us about it in the comments.
Need help? Ask in our discussion forum and we’ll do our best to get you answers.
Slater, L. J., Thirel, G., Harrigan, S., Delaigue, O., Hurley, A., Khouakhi, A., Prosdocimi, I., Vitolo, C., & Smith, K. (2019). Using R in hydrology: a review of recent developments and future directions. Hydrology and Earth System Sciences, 23(7), 2939-2963. https://www.hydrol-earth-syst-sci.net/23/2939/2019/ ↩︎