Ten Years vs The Spread: Calculating publication lag times in R

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There have been several posts on this site about publication lag times. You can read them here. Lag times are the delays in the dissemination of scientific data introduced by the process of publishing the paper in a journal. Nowadays, your paper can be online in a few hours using a preprint server. However, this work is not peer reviewed. Journals organise a formal peer review and provide some sort of certification of the work. They typeset the work and all of this adds delays the dissemination of work in a journal.

To look at publication delays, you can use PubMed data, which is incomplete but can give insight into how long these delays can be. Previous posts have involved the use of a ruby script to make a csv file from PubMed XML output and then use this in Igor to calculate the publication lag times. There is another method detailed in this excellent post by Daniel Himmelstein.

I recently posted a figure for Nature Communications lag times on Twitter and was asked to generate others. I figured that I should write an R script and people can make their own!

The PubMedLagR code is available here with instructions for use.

A query for Nature Communications data at PubMed, such as:

nat commun[ta] AND 2000 : 2018[pdat] AND journal article[pt]

Retrieves all paper for this journal. The range from 2010 to 2018 is for illustration, this journal has only been in operation for these years. Filtering for journal articles rather and attempting to get rid of reviews and front matter is wise, but doesn’t always work. Again this journal doesn’t carry this material so this is for illustration. Getting your query right is very important.

Save the results in XML format and then run the R script as directed. This should give a csv of the data and a png of the lag times.

This is data from Nature Communications. Colleagues had two separate papers accepted at this journal and experienced long delays. I was interested to see if papers were generally taking longer to publish here. Of course we do not know why. Delays are partly the fault of the authors, the reviewers and the journal and it is not possible to say why publication lag times are increasing for this journal year-on-year. The journal has grown in terms of number of papers published, has this introduced inefficiencies? Are reviewers being slow to review? Are they being more demanding? Are Editors not marshalling the referee reports and providing clear guidance to authors? Allowing too much time and too many rounds of revision? Are authors being too slow to do further experimental work? The answer will be yes to some of these questions for some of the papers.

This is not to focus on Nature Communications, it’s one of a few journals that many colleagues complain is too slow to publish their work. With this code you can have a look at the journal you are interested in submitting to and consider whether there is a more rapid venue for your work.

Update:

I changed the code slightly and prettified the plots just a little. Below are some plots for Nature Cell Biology, Nature Neuroscience. I also did a search for clathrin or CRISPR papers over the same time period. These keyword searches are fairly flat, whereas the journal-specific increase in publication lag time can be seen.

The lag times at Nature Neuroscience look artificially low and then seem to have jumped up in 2016 to be something similar to Nature Cell Biology or Nature Communications.

Edit

I neglected to point out that the code truncates the y-axis in the bottom right plot to 1000 days or the maximum lag time, whichever is smaller. This is because it gets difficult to see the data points if there is an outlier, which might be due to an error in PubMed data.

A reader commented on Twitter that some poor paper had a lag time almost 1000 days. Well, due to the y-axis truncation we don’t see that 9 papers in Nature Communications since 2010 have lag times (RecAcc) of > 1000 days. The record holder has a lag time of 1561 days! I checked that this was not a PubMed error by looking at the dates on the paper.

Notes

Date information is not available in PubMed for every paper unfortunately. This is especially true of older papers.

The date information is supplied to PubMed from the journal. These dates are not necessarily accurate: 1) you can see occasional errors in the data, 2) journals sometimes “reset the clock” on papers and treat resubmissions as new submissions.

The post title is taken from “10 Years vs The Spread” by Wing-Tipped Sloat from the LP Chewyfoot. Obviously the song has nothing to do with smoothed kernel density estimates of journal publication lag times, but the title was incredibly apt.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)