API Resources for the Scientific Literature in R and Python

(This article was first published on Paul Oldham's Analytics Blog, and kindly contributed to R-bloggers)

This short post provides details on some of the main APIs (web services) that can be used to monitor and retrieve data from the scientific literature in either R or Python. We are using these packages and libraries as part of a GIZ supported project with the authorities in Kenya who are responsible for providing research permits. Kenya is famous for its biodiversity and the diversity of its communities. However, there is no single repository of publications arising from research in Kenya. We are looking to use APIs to automate retrieval of publications about Kenya and its biodiversity. Hopefully this should allow us to build an open access virtual repository of publications on Kenya to serve the needs of researchers and the wider community.

We plan to use three main APIs for the Kenya project. There are many APIs out there but we will focus on those that aggregate data from different sources. I’ll add a few more that are interesting mainly for biodiversity topics.

Main APIs

Crossref

Crossref provides access to metadata on over 96 million scientific publications. It is not a full text search engine although abstracts are increasingly available as are links to full text versions of articles (which may well be paywalled).

  1. The Crossref API: https://github.com/CrossRef/rest-api-doc
  2. Rcrossref: https://github.com/ropensci/rcrossref.
  3. Python: https://pypi.org/project/habanero/
  4. For lovers of all things Ruby try the Serrano Ruby gem https://github.com/sckott/serrano and rubydoc version

The rcrossref, python and ruby wrappers were all created by Scott Chamberlain and collaborators at the fantastic ROpenSci. Note that searching on crossref is rather limited and so cannot really be used for statistical purposes (the search searches what they have available and that may be quite mixed) BUT crossref is still really useful. In particular it can be used to search for the names of researchers and to retrieve publication details or to enter a list of DOIs.

A walkthrough on using rcrossref to access the scientific literature for Kenya is available here.

For text retrieval and text mining, the crminer package by Scott Chamberlain is intended to facilitate access to full texts for text mining purposes from Crossref. You will also very probably want to check out Scott’s fulltext package for text retrieval from a range of different APIs including some of those listed here.

ORCID

ORCID provides persistent unique identifiers for researchers and access to their public profiles. Where a researcher publishes an article with a DOI that is covered by Crossref, that DOI should automatically (with luck) be added to the researcher’s public profile. Note that you can only access the parts of an ORCID profile that a researcher chooses to make public.

An example of an ORCID public profile is mine: https://orcid.org/0000-0002-1013-4390

Lists of publications can be retrieved using the API and can therefore be used to automate the creation of a repository of publications for a country without needing to chase the researcher through email.

  1. ORCID API home page for creating an app: https://orcid.org/organizations/integrators/API
  2. ORCID Python library: https://github.com/ORCID/python-orcid
  3. ORCID R Package: https://github.com/ropensci/rorcid

Note that when using a remote server the OAuth process (using the rorcid package) can be difficult because the API triggers a browser login. A way around this needs to be found.

core.ac.uk/

Core is a full text database that aggregates scientific publications in open access repositories. It can be difficult to find due to the name. But it provides access to over 131 million open access articles. Taking Kenya as an example, a quick search for Kenya reveals 103,310 publications that contain Kenya somewhere in the text. The services page provides details of the web service, what you can do and how to get started. You will need a free API key from here. Note the quotas and throttle accordingly.

  1. Python notebook with examples: https://github.com/oacore/or2016-api-demo
  2. R Package rcoreaoa: https://github.com/ropensci/rcoreoa

Other APIs

The resources above should capture a lot. But here are some other major APIs that you may want to use.

Springer BioMed Central API

  1. BMC R package https://github.com/ropensci/bmc. This package is not on CRAN. To install it use:
install.packages("devtools")
devtools::install_github("ropensci/bmc")

I couldn’t easily identify a Python library or gist. If you know of one please add to the comments below.

NCBI PubMed

  1. The rentrez package and walkthrough
  2. The easyPubMed package in R:
    See the walkthrough by Daniel Fantini
  3. For Python there is pubmed-lookup and a gist for searching PubMed with Biopython is here

Public Library of Science

  1. Rplos package https://github.com/ropensci/rplos
  2. For Python a gist is available providing examples of the use of the sunburnt library

One of my walkthroughs, now a bit old but still working, for rplos is available here.

bioRxiv

  1. For R the fulltext package provides access to the texts of bioRxiv which has an RSS feed but does not appear to have an API.

I wasn’t able to spot anything for Python and maybe its a matter of wrangling the RSS feed, so if you know of anything please add a comment.

The Alerts/RSS page provides details of the most recent 30 posts across categories and there is a Twitter feed by subject that people have tried to do interesting things with by creating a twitter bots.

Round Up

I hope you found this quick list useful. If you know of any other good resources in either R or Python please feel welcome to add a comment.

References

Chamberlain, Scott. 2017a. Crminer: Fetch ’Scholary’ Full Text from ’Crossref’. https://CRAN.R-project.org/package=crminer.

———. 2017b. Rcoreoa: Client for the Core Api. https://CRAN.R-project.org/package=rcoreoa.

———. 2018a. Fulltext: Full Text of ’Scholarly’ Articles Across Many Data Sources. https://CRAN.R-project.org/package=fulltext.

———. 2018b. Rorcid: Interface to the ’Orcid.org’ ’Api’. https://CRAN.R-project.org/package=rorcid.

Chamberlain, Scott, Carl Boettiger, Ted Hart, and Karthik Ram. 2018. Rcrossref: Client for Various ’Crossref’ ’Apis’. https://github.com/ropensci/rcrossref.

Chamberlain, Scott, Carl Boettiger, and Karthik Ram. 2017. Rplos: Interface to the Search ’Api’ for ’Plos’ Journals. https://CRAN.R-project.org/package=rplos.

Fantini, Damiano. 2018. EasyPubMed: Search and Retrieve Scientific Publication Records from Pubmed. https://CRAN.R-project.org/package=easyPubMed.

Winter, David. 2018. Rentrez: ’Entrez’ in R. https://CRAN.R-project.org/package=rentrez.

To leave a comment for the author, please follow the link and comment on their blog: Paul Oldham's Analytics Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)