An Analysis of Contributions to PubMed Commons

[This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently saw a tweet floating by which included a link to some recent statistics from PubMed Commons, the NCBI service for commenting on scientific articles in PubMed. Perhaps it was this post at their blog. So I thought now would be a good time to write some code to analyse PubMed Commons data.

The tl;dr version: here’s the Github repository and the RPubs report.

For further details and some charts, read on.

Currently, there is no access to PubMed Commons data via the NCBI Entrez API aside from a PubMed search filter to return articles that have comments. However, a Google search for “pubmed commons api” returns this useful Gist. It shows how to construct a URL which returns JSON-formatted PubMed Commons data for a given PMID. If Alf is reading this, I’d like to know how he discovered this information gem!

Armed with this I was able to write Ruby code to return all PMIDs with comments, fetch the comment data, parse it and output a summary to a CSV file. I used to be an XPath guy. This experience changed me into a CSS selector guy.

Analysis and visualisation can then be performed using this RMarkdown file. Here are some of the highlights; the RPubs report contains the complete analysis.

At the time of writing 5 877 “real” comments have been written, for 4 703 articles, authored by 1 504 people. By “real comments”, I mean those with an author name and comment text. This excludes automatically-generated notes and moderated comments (more on those later).
According to the PubMed Commons blog, the service has over 10 500 members, so the active participation rate is about what we’d expect from other forums. The fraction of articles is obviously very small, given that there are now close to 27 000 000 PubMed articles.

The chart of comments by month shows the closed trial period, the opening (October 2013) and some peaks in activity around the end of 2014 and in August 2016. The peaks often correspond to an individual annotating many articles in one sitting with a short comment, as in this example.

More recent articles get more comments. What’s more, it seems that this trend is shifting year on year: that is, comments posted each year tend to be on articles published more recently. I think what’s happening here is that users tend to comment on articles as they are published, which is interesting.
By contrast, the oldest article with a comment currently comes from 1945.

PubMed Commons has a system to rate comments (up/down vote) and around 44% of them have received at least one vote. The most frequent response is that one user finds the comment useful and gives it one up vote. There are a couple of interesting outliers with many more up votes than most comments. These correspond to comments made on the rather-infamous “heroes of CRISPR” review.

Last one. An important aspect of any online forum is moderation and it’s possible to extract this information from the HTML (class=”not_appr”). To date, moderators have removed 66 comments and users have deleted 110 (one presumes either after some thought or some prompting). I’d suggest that this is a very small proportion in comparison to other forums.

Included in the RPubs report, but not here, are some density plots to show distributions of comments per article and comments per author. As you might expect, what’s observed most frequently is one comment (per article or author), followed by a “long tail”. You may be interested in the article with the most comments. Currently it’s an editorial titled “When Is Science Ultimately Unreliable?“: you can decide for yourself why it is creating debate. You may also be interested in the most prolific comment authors; I was not, so that’s left as an exercise for those interested – the CSV file is available.

In my opinion, PubMed Commons is a valuable and reasonably-successful service. It’s obviously something of a “niche” online forum and is never going to set the world alight. However, monthly activity has remained relatively consistent, with more activity in 2016 compared with 2015. Users seems to find many of the comments valuable and community standards are high. It’s interesting that a lot of discussion is around articles as they are published. This is good, but I think we also need maintenance annotation of older articles to point out issues such as broken URLs.

All it needs now is more active users, more comments per user and a real API.

Filed under: publications, R, statistics Tagged: comments, ncbi, pubmed commons

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)