The National Centre for Biotechnology Information (NCBI) is part…

[This article was first published on biochemistries, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.



The National Centre for Biotechnology Information (NCBI) is part of the National Institutes of Health’s National Library of Medicine, and most well-known for hosting Pubmed, the go-to search engine for biomedical literature – every (Medline-indexed) publication goes up there.

On a separate but related note, one thing I’m constantly looking to do is get DOIs for papers on demand. Most recently I found a package for R, knitcitations that generates bibliographies automatically from DOIs in the text, which worked quite well for a 10 page literature review chock full of references (I’m a little allergic to Mendeley and other clunky reference managers).

The “Digital Object Identifier”, as the name suggests, uniquely identifies a research paper (and recently it’s being co-opted to reference associated datasets). There’re lots of interesting and troublesome exceptions which I’ve mentioned previously, but in the vast majority of cases any paper published in at least the last 10 years or so will have one.

Although NCBI Pubmed does a great job of cataloguing biomedical literature, another site, doi.org provides a consistent gateway to the original source of the paper. You only need to append the DOI to “dx.doi.org/” to generate a working redirection link.

Last week the NCBI posted a webinar detailing the inner workings of Entrez Direct, the command line interface for Unix computers (GNU/Linux, and Macs; Windows users can fake it with Cygwin). It revolves around a custom XML parser written in Perl (typical for bioinformaticians) encoding subtle ‘switches’ to tailor the output just as you would from the web service (albeit with a fair portion more of the inner workings on show).

I’ve been in a bit of a programming phase recently, starting to read some linear/integer and dynamic programming texts (alongside this course from UCB) so was in the mood to get something working with this.

I’ve pieced together a basic pipeline, which has a function to generate citations for knitcitations from files listing basic bibliographic information, and in the final piece of the puzzle now have a custom function (or several) that does its best to find a single unique article matching the author, publication year, and title of a paper systematically, to find DOIs for entries in such a table.

It’s the sort of thing that’s best shown than described really. The animation above is the function I finished today in action, sequentially adding a fourth column to a table of file info with DOI, all automated through Entrez Direct. It’ll probably be something I can improve upon iteratively, but for now I thought I’d share the code for anyone else starting to play around with the service.

For those not familiar with a development system setup, this code is in the .bashrc file found in the home directory, which contains useful shortcuts, called “aliases”, as well as more elaborate functions which allow all sorts of file, text and program manipulation.

The “Unix philosophy” dictates that thou shalt pipe – connect simple functions together in a mix-and-match manner, so that’s my excuse for what seems like an overcomplicated .bashrc file…

Entrez Direct has nice, concise documentation, that will explain this a whole lot better than me. It’s well worth pointing out that Pubmed is just one of the NCBI’s libraries. You can also access genetic, oncology (OMIM), protein, and other types of data through this very same interface.

One technical note not on the NCBI site: when installing the setup script added a “source .bashrc” command to my .bashrc, ‘sourcing’ my .bash_profile, which was already in turn ‘sourcing’ my .bashrc, effectively putting every new terminal command prompt in an infinite loop – watch out for this if your terminals freeze then quit after installation!

The scripts below are available here, I’ll update them on the GitHub Gist if I make amendments.

The main functions in the script are AddPubDOI and AddPubTableDOIs (I renamed it from the less descriptive title in the screenshot animation above) the former being executed for every line in the input table within the latter. Weird bug/programming language feature who knows where – you can’t use the traditional while read variable; do function(variable); done < inputfile construction to handle a file line by line, so I resorted to cat trickery. I blame Perl.

  • cutf is my shorthand to tell the cut command I want a specific column in a tab-separated file or variable.
  • striptoalpha is a function I made here to turn paper titles into all-lowercase squished together strings of letters (no dashes, commas etc that might get in the way of text comparison) in a really crude way of checking one name against another. This part of the script could easily be improved, but I was just sorting out one funny case - usually matching author and year and using a loose title match will be sufficient to find the matching Pubmed entry, for which a DOI can be found.
  • pubmed chains together: esearch to search pubmed for the query; efetch to get the document (i.e. article) summaries as XML; and xtract to get the basic info. I don’t use this in my little pipeline setup, rather I kept my options open and chose to get more information, and match within blocks of the XML for the DOI. It’s not so complicated to follow, as well as my code there’s this example on Biostars.
  • pubmeddocsum just does the first 2 of the steps above: providing full unparsed XML ‘docsums’
  • pubmedextractdoi gets date and DOI information as columns, then uses GNU awk to rearrange the columns in the output
  • pubmeddoi gives just the DOI column from said rearranged output
  • pubmeddoimulti has ‘multiple’ ways to try and get the DOI for an article matched from searching Pubmed: firstly from the DOI output, then attempting to use the pmid2doi service output.
  • pmid2doimulti does as for pubmeddoimulti but from a provided PMID
  • pmid2doi handles the pmid2doi.org response, pmid2doincbi the Entrez Direct side, both feed into pmid2doimulti.

Rookie’s disclaimer: I’m aware pipelines are suposed to contain more um, pipes, but I can’t quite figure out an easy way to make these functions ‘pipe’ to one another, so I’m sticking with passing the output to the next as input ("$@" in bash script).

Update: the second file added to the GitHub gist has the code needed to tie this to a keyboard shortcut (I’m using Alt+Windows Key+P) :

E.g. with Rolland et al. (2014) A Proteome-Scale Map of the Human Interactome Network:

Update 2: the keybinding now lets you hit the spacebar to open the article in a web browser, and I rejigged the main function to write to a file, see comments in code. The end of the GitHub gist also has some functions I use to copy a DOI from one of the references in the table file created to the clipboard, including one to cite it for knitcitations Rmarkdown as mentioned above.

To leave a comment for the author, please follow the link and comment on their blog: biochemistries.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)