Prehistoric: when do authors preprint their papers?

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Previously, I took advantage of a dataset that linked preprints to their published counterparts to look at the fraction of papers in a journal that are preprinted. This linkage can be used to answer other interesting questions. Such as: when do authors preprint their papers relative to submission? And does this differ by journal?

There’s a bit of preamble. If you just want to know the answer, click here. If you want to see the code, click here.

For each paper, we can extract from PubMed the “received” date and the “accepted” date. Because we have linked published papers to preprints, we also know the date when the preprint of the paper was first posted. Subtracting this date from the received date, we get something we’ll call “pretime”.

Now let’s plot the Pretime versus the Received to Accepted time.

In the plots above, we see ~3 years of a paper’s journey to acceptance. Let’s zoom in a bit to look at the first year.

What does this mean? To help interpret the plot, here’s a key:

There are four categories; manuscripts are posted to bioRxiv:

  • Prior to submission
  • Approximately at submission time
  • After submission
  • After acceptance

Note that we are looking at the final journal destination for each paper, which might not be the first place a paper is submitted. It’s likely that papers posted prior to submission, especially those with long pretimes, were submitted elsewhere first; rather than the authors posting their work early for the purpose of gathering feedback before a first submission. All journals have such papers, not just the sibling journals like Nature Communications and Cell Reports, which were created to capture papers following rejection from other titles.

The plots indicate that many papers are preprinted at the same time as submission. There are also a surprising number preprinted after submission. Very few preprints are posted after acceptance, for obvious reasons.

To simplify things, we can classify preprints with pretimes of -7 to 30 days as those papers preprinted at submission. Papers with less than are post-submission, those with more are pre-submission.


The answer

The analysis shows that generally, most authors preprint their work around the moment of submission.

Let’s look at how these fractions breakdown at each journal.

The fraction of papers preprinted upon submission is largest at several journals including Biochem J, Development, EMBO J etc. If we consider that many of the pre-submission preprints were posted around the submission time to a preceding journal, then preprinting upon submission is the most likely behaviour.

The fraction of papers posted after submission is a minority activity but it is sizeable at some journals, notably Nature Cell Biol and Dev Cell. Possible reasons why authors may only choose to post after submission (in some cases many months later) might include: a belief that preprinting may cause desk rejection, only preprinting after the paper has gone out to review, or authors getting twitchy about priority during a lengthy peer review process.

We can break down the data by year of publication to see that the patterns are fairly consistent over time.


Any analysis like this is limited by the available data. First, the “received” date on PubMed may not be accurate. A journal may “reset the clock” on a submission and thereby make it appear that the preprint had been posted prior to submission when it may have actually been submitted to the publishing journal at the time of posting.

This analysis is also limited to:

  • papers that were preprinted on bioRxiv
  • papers for which we had complete data (the PubMed data is missing for some journals)
  • a subset of journals – other journal data can be retrieved by tweaking the code

To reiterate that the analysis is limited to papers where the authors actually posted a preprint. At many of the journals analysed here, over half of the authors still choose not to preprint their work!

The code

This R script is quite long and has a few dependencies from my earlier post. Crunching through the xml files and through the bioRxiv dois to get the submission dates is sped up using parallel processing (on Mac/linux).


## some pre-requisite files required for this script
# preprint - paper relationships
df_all <- read.csv("Data/crossref-preprint-article-relationships-Aug-2023.csv")
# code to extract data from Pubmed XML files
# previously downloaded Pubmed XML files in the Data directory
xml_files <- list.files("Data", pattern = "*.xml", full.names = TRUE)

# setup parallel backend
cores <- detectCores()
cl <- makeCluster(cores[1] - 1) #not to overload your computer

pprs <- foreach(i = 1 : seq_along(xml_files), .combine = rbind) %dopar% {
  tempdf <- extract_xml(xml_files[i])

# stop cluster

# remove duplicates
pprs <- pprs[!duplicated(pprs$pmid), ]

# remove unwanted publication types by using a vector of strings
unwanted <- c("Review", "Comment", "Retracted Publication",
              "Retraction of Publication", "Editorial", "Autobiography",
              "Biography", "Historical", "Published Erratum",
              "Expression of Concern", "Editorial")
# subset pprs to remove unwanted publication types using grepl, call this "pure"
pure <- pprs[!grepl(paste(unwanted, collapse = "|"), pprs$ptype), ]
# ensure that ptype contains "Journal Article"
pure <- pure[grepl("Journal Article", pure$ptype), ]
# remove papers with "NA NA" as the sole author
pure <- pure[!grepl("NA NA", pure$authors), ]

# add factor column to pure that indicates if a row in pprs has a doi that is
# also found in df_all$article_doi
pure$in_crossref <- ifelse(tolower(pure$doi) %in%
                             tolower(df_all$article_doi), "yes", "no")

# lag times
pure$recacc <- pure$accdate - pure$recdate
pure$recpub <- pure$pubdate - pure$recdate

# subset data for only in_crossref == "yes"
pure_yes <- pure[pure$in_crossref == "yes", ]
# add column that has the preprint_doi from df_all where article_doi matches doi
pure_yes$preprint_doi <- df_all$preprint_doi[match(tolower(pure_yes$doi),
# subset for biorxiv doi, i.e. starts "10.1101"
pure_yes <- pure_yes[grepl("10.1101", pure_yes$preprint_doi), ]

# if the preprint_doi is longer than 15 characters, parse the date from the doi
# and if it is less than 15 characters, set to NA
pure_yes$date <- as.Date.numeric(ifelse(nchar(pure_yes$preprint_doi) < 16,
                                          substr(pure_yes$preprint_doi, 9, 18),
                                          format = "%Y.%m.%d")))

# subset pure_yes for date is NA
pure_yes_na <- pure_yes[$date), ]

# get the content of each preprint and assemble into large data frame
preprints <- foreach(i = 1:nrow(pure_yes_na),
                     .errorhandling = "pass", .multicombine = TRUE) %do% {
                       temp <- NULL
                       temp <- = pure_yes_na$preprint_doi[i]))
                       # subset to only include the doi, authors, title, and date; and first row only
                       if (!is.null(temp)) {
                         temp <- temp[1, c("doi", "authors", "title", "date")]

# the above code results in a large list of data frames, so we need to combine
# them into one data frame. We didn't use .combine, because we wanted to remove
# one or more of the preprints may have failed to download. The failed items do
# not have 4 columns, so we can use ncol to check for this

ncol_preprints <- sapply(preprints, ncol)
# write a for loop to start at the end of the list and remove the failed items
list_preprints <- preprints
for (i in rev(seq_along(list_preprints))) {
  if (is.null(ncol_preprints[[i]])) {
    list_preprints <- list_preprints[-i]

df_preprints <-, list_preprints)

# add a column to pure_yes_na that has the date from df_preprints
pure_yes_na$date <- df_preprints$date[match(tolower(pure_yes_na$preprint_doi),
# if pure_yes$date is NA, set to pure_yes_na$date
pure_yes_all <- pure_yes
pure_yes_all$date <- ifelse($date),
# ensure date is as.Date
pure_yes_all$date <- as.Date(pure_yes_all$date, format = "%Y-%m-%d")
# find pretime by subtracting the date from the recdate
pure_yes_all$pretime <- pure_yes_all$recdate - pure_yes_all$date

pure_yes_all %>% 
  filter(! %>%
  ggplot(aes(x = as.numeric(recacc), y = as.numeric(pretime))) +
  geom_abline(intercept = 0,
              slope = -1, linetype = "dashed", colour = "#a3a3a3") +
  geom_point(colour = "#ae363b", shape = 16, size = 0.5, alpha = 0.2) +
  theme_minimal(9) +
  lims(x = c(0, 1000), y = c(-1000, 1000)) +
  facet_wrap( ~ journal) +
  labs(x = "Received to Accepted (days)", y = "Pretime (days)") +
  theme(legend.position = "none")
       width = 3000, height = 1500, dpi = 300, units = "px", bg = "white")  

pure_yes_all %>% 
  filter(! %>%
  ggplot(aes(x = as.numeric(recacc), y = as.numeric(pretime))) +
  geom_abline(intercept = 0,
              slope = -1, linetype = "dashed", colour = "#a3a3a3") +
  geom_point(colour = "#ae363b", shape = 16, size = 0.7, alpha = 0.2) +
  theme_minimal(9) +
  lims(x = c(0, 365), y = c(-365, 365)) +
  facet_wrap( ~ journal) +
  labs(x = "Received to Accepted (days)", y = "Pretime (days)") +
  theme(legend.position = "none")
       width = 3000, height = 1500, dpi = 300, units = "px", bg = "white")

# pure_yes_all contains the data of interest. Let's classify the papers
# into three categories: 1) preprinted on submission, 2) preprinted after
# submission, and 3) preprinted prior to submission
# To classify them, group 1 is pretime of -7 to 30 days, group 2 is pretime
# of greater than 31 days, and group 3 is pretime of less than -7 days
# make a factor column to classify the papers
pure_yes_all$preprint_status <- ifelse(pure_yes_all$pretime >= 31, "Pre-submission",
                                       ifelse(pure_yes_all$pretime <= -7, "Post-submission",
# now summarise the fraction of papers at each journal that are in each category
summary_status <- pure_yes_all %>%
  filter(! %>%
  group_by(journal, preprint_status) %>%
  summarise(papers = n()) %>%
  group_by(journal) %>%
  mutate(fraction = papers / sum(papers))
# order fraction so that post, on, pre submission are in the correct order
summary_status$preprint_status <- factor(summary_status$preprint_status,
                                         levels = c("Pre-submission", "On-submission",

# make a stacked bar chart to show the fraction of papers in each category
# for each journal
# Pre submission at the top, on submission middle and post submission at the bottom
summary_status %>%
  ggplot(aes(x = journal, y = fraction, fill = preprint_status)) +
  geom_bar(stat = "identity", position = "stack") +
  scale_x_discrete(guide = guide_axis( = 2)) +
  theme_minimal(9) +
  labs(x = "Journal", y = "Fraction of papers") +
  theme(legend.position = "right",
        legend.title = element_blank()) +
  scale_fill_manual(values = c("#534666", "#138086", "#cd7672"))
       width = 3000, height = 1500, dpi = 300, units = "px", bg = "white")

# let's do the same again but only look at each journal and facet by year
pure_yes_all %>%
  filter(! %>%
  filter(!year == "2024") %>% 
  group_by(journal, year, preprint_status) %>%
  summarise(papers = n()) %>%
  group_by(journal, year) %>%
  mutate(fraction = papers / sum(papers)) %>%
  ggplot(aes(x = year,
             y = fraction,
             fill = factor(preprint_status,
                           levels = c("Pre-submission", "On-submission",
                                      "Post-submission")))) +
  geom_bar(stat = "identity", position = "stack") +
  theme_minimal(9) +
  labs(x = "Journal", y = "Fraction of papers") +
  theme(legend.position = "right",
        legend.title = element_blank()) +
  scale_fill_manual(values = c("#534666", "#138086", "#cd7672")) +
  facet_wrap( ~ journal)
       width = 3000, height = 1500, dpi = 300, units = "px", bg = "white")

# generate summary stats for table (all papers with linked preprint)
summary_all <- pure_yes_all %>%
  filter(! %>%
  group_by(preprint_status) %>%
  summarise(papers = n()) %>%
  mutate(fraction = papers / sum(papers))

The post title comes from “Prehistoric” by Circulatory System from their “Circulatory System” LP.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)