[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Before we start: yes, we’ve been here before. There was the Biostars question “Calculating Time From Submission To Publication / Degree Of Burden In Submitting A Paper.” That gave rise to Pierre’s excellent blog post and code + data on Figshare.

So why are we here again? 1. It’s been a couple of years. 2. This is the R (+ Ruby) version. 3. It’s always worth highlighting how the poor state of publicly-available data prevents us from doing what we’d like to do. In this case the interesting question “which bioinformatics journal should I submit to for rapid publication?” becomes “here’s an incomplete analysis using questionable data regarding publication dates.”

Let’s get it out of the way then.

1. Find a list of bioinformatics journals

Here’s one, at the Bioinformatics.org wiki. It includes a metric named Article Influence that we can use to sort the journals. Let’s be completely arbitrary and take the top 20.

getJournalTitles <- function() {
require(XML)
journals <- readHTMLTable("http://www.bioinformatics.org/wiki/Journals", stringsAsFactors = FALSE)
journals <- journals[[2]]
journals[, 2] <- as.numeric(journals[, 2])
journals <- journals[order(journals[, 2], decreasing = TRUE), ]
return(titles)
}

titles <- getJournalTitles()


Next, we search PubMed for those journal titles and download records in PubMed XML format. During this process I learned that (1) the ampersand in Molecular & Cellular Proteomics should be replaced by “and”, (2) Proteins: Structure, Function, and Bioinformatics should be renamed “Proteins” and (3) IEEE Transactions on Evolutionary Computation is apparently not indexed by PubMed.

getJournalXML <- function(title) {
require(rentrez)
term <- paste(title, "[JOUR]", sep = "")
e <- entrez_search("pubmed", term, usehistory = "y")
f <- entrez_fetch("pubmed", WebEnv = e$WebEnv, query_key = e$QueryKey,
rettype = "xml", retmax = e$count) d <- xmlTreeParse(f, useInternalNodes = TRUE) outfile <- paste(gsub(" ", "_", title), "xml", sep = ".") saveXML(xmlRoot(d), outfile) } titles[6] <- gsub("&", "and", titles[6]) titles[11] <- "Proteins" # saves XML files in current working directory sapply(titles, function(x) getJournalXML(x))  3. Parse for publication dates Yes, submission to publication time includes time for revision(s). However, submission to initial decision times are not readily-available (certainly not from PubMed), and acceptance to publication times mean nothing in the age of “ahead of print”, so the first of these is what we use. I haven’t figured out how to make the R/XML xpathSApply() function return empty values where nodes don’t exist, so I went for Ruby/Nokogiri which does that by default. Cue extraordinarily-ugly code: #!/usr/bin/ruby require 'nokogiri' f = File.open(ARGV.first) doc = Nokogiri::XML(f) f.close doc.xpath("//PubmedArticle").each do |a| r = ["", "", "", "", "", "", "", ""] r[0] = a.xpath("MedlineCitation/Article/Journal/ISOAbbreviation").text r[1] = a.xpath("MedlineCitation/PMID").text r[2] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='received']/Year").text r[3] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='received']/Month").text r[4] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='received']/Day").text r[5] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Year").text r[6] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Month").text r[7] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Day").text puts r.join(",") end  If you save that as pubmedXML2CSV.rb, you can run it on all the XML files in the current directory using: find . -name "*.xml" -exec ruby pubmedXML2CSV.rb {} > bioinfjournals.csv ;  4. Analyse data Time from received to accepted for selected journals according to PubMed It’s reasonably plain sailing from here onwards. We read the CSV file into R. Not all records have received or accepted dates and for those that do, it’s messy. Months, for example, are variously represented as January, Jan or 1. It would be nice to have a function that could make a Date object from anything resembling “year-month-day” and happily, the R/lubridate package provides ymd() to do just that. You’ll also find articles submitted from the future (the year 2919, for example), so best to remove records where submission apparently happened after acceptance. Perhaps the most exciting thing about the following code is that I’m not alone in wanting to sort boxplots by median and found a solution to do so. Click the image, right, for the larger version. plotJournalTimes <- function(csvfile) { require(lubridate) require(ggplot2) journals <- read.csv(csvfile, header=FALSE, stringsAsFactors=FALSE) colnames(journals) <- c("title", "pmid", "rec.year", "rec.month", "rec.day", "acc.year", "acc.month", "acc.day") journals$received  <- ymd(paste(journals$rec.year, journals$rec.month, journals$rec.day, sep = "-")) journals$accepted  <- ymd(paste(journals$acc.year, journals$acc.month, journals$acc.day, sep = "-")) journals$diff      <- as.numeric(journals$accepted - journals$received)
ggplot(subset(journals, diff > 0), aes(reorder(title, diff, median), diff / (24 * 3600))) +
geom_boxplot(fill = "wheat2") + theme_bw() + coord_flip() +
ylab("accepted - received (days)") + xlab("journal")
}

plotJournalTimes("bioinfjournals.csv")


So there you have it. No data for one of the top 20 journals. No accepted and/or received date for 9 of the others. Of the 10 remaining, only about 48% of the 64 759 records include dates that can be parsed. Of those, at least one and probably more are rather dubious. Very short times are as likely to be outliers (erroneous) as very long times.

If you still care by this point: Mammalian Genome is the winner with a median time to acceptance of 80 days, going up to 175.5 days for Journal of Computational Neuroscience. 11 weeks still seems like a long time to me, even if you believe the numbers. Which you probably should not.

Filed under: bioinformatics, programming, R, ruby, statistics Tagged: journals, publishing, pubmed