This short post provides details on some of the main APIs (web services) that can be used to monitor and retrieve data from the scientific literature in either R or Python. We are using these packages and libraries as part of a GIZ supported project with the authorities in Kenya who are responsible for providing research permits. Kenya is famous for its biodiversity and the diversity of its communities. However, there is no single repository of publications arising from research in Kenya. We are looking to use APIs to automate retrieval of publications about Kenya and its biodiversity. Hopefully this should allow us to build an open access virtual repository of publications on Kenya to serve the needs of researchers and the wider community.
We plan to use three main APIs for the Kenya project. There are many APIs out there but we will focus on those that aggregate data from different sources. I’ll add a few more that are interesting mainly for biodiversity topics.
Crossref provides access to metadata on over 96 million scientific publications. It is not a full text search engine although abstracts are increasingly available as are links to full text versions of articles (which may well be paywalled).
- The Crossref API: https://github.com/CrossRef/rest-api-doc
- Rcrossref: https://github.com/ropensci/rcrossref.
- Python: https://pypi.org/project/habanero/
- For lovers of all things Ruby try the Serrano Ruby gem https://github.com/sckott/serrano and rubydoc version
The rcrossref, python and ruby wrappers were all created by Scott Chamberlain and collaborators at the fantastic ROpenSci. Note that searching on crossref is rather limited and so cannot really be used for statistical purposes (the search searches what they have available and that may be quite mixed) BUT crossref is still really useful. In particular it can be used to search for the names of researchers and to retrieve publication details or to enter a list of DOIs.
A walkthrough on using rcrossref to access the scientific literature for Kenya is available here.
For text retrieval and text mining, the crminer package by Scott Chamberlain is intended to facilitate access to full texts for text mining purposes from Crossref. You will also very probably want to check out Scott’s fulltext package for text retrieval from a range of different APIs including some of those listed here.
ORCID provides persistent unique identifiers for researchers and access to their public profiles. Where a researcher publishes an article with a DOI that is covered by Crossref, that DOI should automatically (with luck) be added to the researcher’s public profile. Note that you can only access the parts of an ORCID profile that a researcher chooses to make public.
An example of an ORCID public profile is mine: https://orcid.org/0000-0002-1013-4390
Lists of publications can be retrieved using the API and can therefore be used to automate the creation of a repository of publications for a country without needing to chase the researcher through email.
- ORCID API home page for creating an app: https://orcid.org/organizations/integrators/API
- ORCID Python library: https://github.com/ORCID/python-orcid
- ORCID R Package: https://github.com/ropensci/rorcid
Note that when using a remote server the OAuth process (using the rorcid package) can be difficult because the API triggers a browser login. A way around this needs to be found.
Core is a full text database that aggregates scientific publications in open access repositories. It can be difficult to find due to the name. But it provides access to over 131 million open access articles. Taking Kenya as an example, a quick search for Kenya reveals 103,310 publications that contain Kenya somewhere in the text. The services page provides details of the web service, what you can do and how to get started. You will need a free API key from here. Note the quotas and throttle accordingly.
The resources above should capture a lot. But here are some other major APIs that you may want to use.
- BMC R package https://github.com/ropensci/bmc. This package is not on CRAN. To install it use:
I couldn’t easily identify a Python library or gist. If you know of one please add to the comments below.
- Rplos package https://github.com/ropensci/rplos
- For Python a gist is available providing examples of the use of the sunburnt library
One of my walkthroughs, now a bit old but still working, for rplos is available here.
- For R the fulltext package provides access to the texts of bioRxiv which has an RSS feed but does not appear to have an API.
I wasn’t able to spot anything for Python and maybe its a matter of wrangling the RSS feed, so if you know of anything please add a comment.
The Alerts/RSS page provides details of the most recent 30 posts across categories and there is a Twitter feed by subject that people have tried to do interesting things with by creating a twitter bots.
I hope you found this quick list useful. If you know of any other good resources in either R or Python please feel welcome to add a comment.
Chamberlain, Scott. 2017a. Crminer: Fetch ’Scholary’ Full Text from ’Crossref’. https://CRAN.R-project.org/package=crminer.
———. 2017b. Rcoreoa: Client for the Core Api. https://CRAN.R-project.org/package=rcoreoa.
———. 2018a. Fulltext: Full Text of ’Scholarly’ Articles Across Many Data Sources. https://CRAN.R-project.org/package=fulltext.
———. 2018b. Rorcid: Interface to the ’Orcid.org’ ’Api’. https://CRAN.R-project.org/package=rorcid.
Chamberlain, Scott, Carl Boettiger, Ted Hart, and Karthik Ram. 2018. Rcrossref: Client for Various ’Crossref’ ’Apis’. https://github.com/ropensci/rcrossref.
Chamberlain, Scott, Carl Boettiger, and Karthik Ram. 2017. Rplos: Interface to the Search ’Api’ for ’Plos’ Journals. https://CRAN.R-project.org/package=rplos.
Fantini, Damiano. 2018. EasyPubMed: Search and Retrieve Scientific Publication Records from Pubmed. https://CRAN.R-project.org/package=easyPubMed.
Winter, David. 2018. Rentrez: ’Entrez’ in R. https://CRAN.R-project.org/package=rentrez.