The National Centre for Biotechnology Information (NCBI) is part of the National Institutes of Health’s National Library of Medicine, and most well-known for hosting Pubmed, the go-to search engine for biomedical literature – every (Medline-indexed) publication goes up there.
On a separate but related note, one thing I’m constantly looking to do is get DOIs for papers on demand. Most recently I found a package for R,
knitcitations that generates bibliographies automatically from DOIs in the text, which worked quite well for a 10 page literature review chock full of references (I’m a little allergic to Mendeley and other clunky reference managers).
The “Digital Object Identifier”, as the name suggests, uniquely identifies a research paper (and recently it’s being co-opted to reference associated datasets). There’re lots of interesting and troublesome exceptions which I’ve mentioned previously, but in the vast majority of cases any paper published in at least the last 10 years or so will have one.
Although NCBI Pubmed does a great job of cataloguing biomedical literature, another site, doi.org provides a consistent gateway to the original source of the paper. You only need to append the DOI to “dx.doi.org/” to generate a working redirection link.
Last week the NCBI posted a webinar detailing the inner workings of Entrez Direct, the command line interface for Unix computers (GNU/Linux, and Macs; Windows users can fake it with Cygwin). It revolves around a custom XML parser written in Perl (typical for bioinformaticians) encoding subtle ‘switches’ to tailor the output just as you would from the web service (albeit with a fair portion more of the inner workings on show).
I’ve been in a bit of a programming phase recently, starting to read some linear/integer and dynamic programming texts (alongside this course from UCB) so was in the mood to get something working with this.
I’ve pieced together a basic pipeline, which has a function to generate citations for knitcitations from files listing basic bibliographic information, and in the final piece of the puzzle now have a custom function (or several) that does its best to find a single unique article matching the author, publication year, and title of a paper systematically, to find DOIs for entries in such a table.
It’s the sort of thing that’s best shown than described really. The animation above is the function I finished today in action, sequentially adding a fourth column to a table of file info with DOI, all automated through Entrez Direct. It’ll probably be something I can improve upon iteratively, but for now I thought I’d share the code for anyone else starting to play around with the service.
For those not familiar with a development system setup, this code is in the .bashrc file found in the home directory, which contains useful shortcuts, called “aliases”, as well as more elaborate functions which allow all sorts of file, text and program manipulation.
The “Unix philosophy” dictates that thou shalt pipe – connect simple functions together in a mix-and-match manner, so that’s my excuse for what seems like an overcomplicated .bashrc file…
Entrez Direct has nice, concise documentation, that will explain this a whole lot better than me. It’s well worth pointing out that Pubmed is just one of the NCBI’s libraries. You can also access genetic, oncology (OMIM), protein, and other types of data through this very same interface.
One technical note not on the NCBI site: when installing the setup script added a “
source .bashrc” command to my .bashrc, ‘sourcing’ my .bash_profile, which was already in turn ‘sourcing’ my .bashrc, effectively putting every new terminal command prompt in an infinite loop – watch out for this if your terminals freeze then quit after installation!
The scripts below are available here, I’ll update them on the GitHub Gist if I make amendments.
The main functions in the script are
AddPubTableDOIs (I renamed it from the less descriptive title in the screenshot animation above) the former being executed for every line in the input table within the latter. Weird bug/programming language feature who knows where – you can’t use the traditional
while read variable;
done < inputfile construction to handle a file line by line, so I resorted to
cat trickery. I blame Perl.
cutfis my shorthand to tell the
cutcommand I want a specific column in a tab-separated file or variable.
striptoalphais a function I made here to turn paper titles into all-lowercase squished together strings of letters (no dashes, commas etc that might get in the way of text comparison) in a really crude way of checking one name against another. This part of the script could easily be improved, but I was just sorting out one funny case - usually matching author and year and using a loose title match will be sufficient to find the matching Pubmed entry, for which a DOI can be found.
esearchto search pubmed for the query;
efetchto get the document (i.e. article) summaries as XML; and
xtractto get the basic info. I don’t use this in my little pipeline setup, rather I kept my options open and chose to get more information, and match within blocks of the XML for the DOI. It’s not so complicated to follow, as well as my code there’s this example on Biostars.
pubmeddocsumjust does the first 2 of the steps above: providing full unparsed XML ‘docsums’
pubmedextractdoigets date and DOI information as columns, then uses GNU awk to rearrange the columns in the output
pubmeddoigives just the DOI column from said rearranged output
pubmeddoimultihas ‘multiple’ ways to try and get the DOI for an article matched from searching Pubmed: firstly from the DOI output, then attempting to use the pmid2doi service output.
pmid2doimultidoes as for
pubmeddoimultibut from a provided PMID
pmid2doihandles the pmid2doi.org response,
pmid2doincbithe Entrez Direct side, both feed into
Rookie’s disclaimer: I’m aware pipelines are suposed to contain more um, pipes, but I can’t quite figure out an easy way to make these functions ‘pipe’ to one another, so I’m sticking with passing the output to the next as input (
"$@" in bash script).
Update: the second file added to the GitHub gist has the code needed to tie this to a keyboard shortcut (I’m using Alt+Windows Key+P) :
E.g. with Rolland et al. (2014) A Proteome-Scale Map of the Human Interactome Network:
Update 2: the keybinding now lets you hit the spacebar to open the article in a web browser, and I rejigged the main function to write to a file, see comments in code. The end of the GitHub gist also has some functions I use to copy a DOI from one of the references in the table file created to the clipboard, including one to cite it for knitcitations Rmarkdown as mentioned above.