Over and Over: Preprint revisions on bioRxiv

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The aim of this post is to look at revisions of bioRxiv preprints. I’m interested how long preprint versions exist on bioRxiv. In other words: how long do revisions to preprints take?

The data from bioRxiv is a complex dataset with many caveats as I’ll explain further down, but some interesting details do emerge. Consider this a sketch of the dataset rather than an in-depth analysis. I’ll walk you through the code.

I used Nicholas Fraser’s rbiorxiv to get a data frame containing all preprints on bioRxiv to date.

# Install package
install.packages("devtools")
devtools::install_github("nicholasmfraser/rbiorxiv")
# Load packages
library(rbiorxiv)
library(tidyverse)
library(cowplot)

# make directory for output if it doesn't exist
if (dir.exists("Output")==FALSE) dir.create("Output")
if (dir.exists("Output/Plots")==FALSE) dir.create("Output/Plots")
if (dir.exists("Output/Data")==FALSE) dir.create("Output/Data")

# get data
df_2021 <- biorxiv_content(from = "2021-01-01", to = "2021-06-30", limit = "*", format = "df")
df_2020a <- biorxiv_content(from = "2020-07-01", to = "2020-12-31", limit = "*", format = "df")
df_2020b <- biorxiv_content(from = "2020-01-01", to = "2020-06-30", limit = "*", format = "df")
df_2019a <- biorxiv_content(from = "2019-07-01", to = "2019-12-31", limit = "*", format = "df")
df_2019b <- biorxiv_content(from = "2019-01-01", to = "2019-06-30", limit = "*", format = "df")
df_2018 <- biorxiv_content(from = "2018-01-01", to = "2018-12-31", limit = "*", format = "df")
df_2017 <- biorxiv_content(from = "2017-01-01", to = "2017-12-31", limit = "*", format = "df")
df_2016 <- biorxiv_content(from = "2016-01-01", to = "2016-12-31", limit = "*", format = "df")
df_2013_15 <- biorxiv_content(from = "2013-01-01", to = "2015-12-31", limit = "*", format = "df")
# load into one dataframe
df_all <- rbind(df_2021,df_2020a,df_2020b,df_2019a,df_2019b,df_2018,df_2017,df_2016,df_2013_15)

I split the retrieval of later years into two for stability. The retrieval takes some time and so I saved the retrieved data frames with saveRDS so that I could load a local copy later if required.

We now have all the data, how many records do we have?

# this code needs to be run after all the code blocks below
# how many preprint versions in total?
nrow(df_all)
[1] 162102
# how many unique preprints?
length(unique(df_all$doi))
[1] 117525

A preprint that has been revised twice will have three records: version 1, 2 and 3. So how many preprints have only a single version on bioRxiv, with no revisions?

# this code needs to be run after all the code blocks below
# single version preprints
length(unique(df_all$doi)) - nrow(udois)
[1] 86192

So there are 86192 single version preprints on bioRxiv and 31333 preprints with more than one version. How many versions are there for each of these?

# this code needs to be run after all the code blocks below
with(df_maxtime, tapply(df_maxtime$maxlag, list("Version"=maxver, "Published"=pnp), length))
       Published
Version published unpublished
    2       12289       10098
    3        3742        2561
    4        1048         725
    5         333         198
    6         112          75
    > 6        81          71

Of the 31K preprints with more than one version, most (22K) are revised once. There are ~150 that have been revised six times or more.

What is the most-revised preprint on bioRxiv?

# what is the maximum version number?
max(df_all$version)
[1] 25

The most-revised preprint is currently on version 25! It is this one, currently unpublished, so there may still be more revisions.

How long do preprint revisions take?

To look at this question, we need to focus on preprints that have 2 or more versions.

# look at preprints with 2 or more versions
df_sub <- subset(df_all, df_all$version > 1)
udois <- as.data.frame(unique(df_sub$doi))
udois$id <- seq_len(nrow(udois))
names(udois) <- c("doi","id")
df_multi <- merge(df_all,udois, by = "doi")
df_multi$date <- as.Date(df_multi$date)
df_multi$revision <- as.factor(df_multi$version)

Here we subset for preprints with two or more versions. Now we have multiple records for these preprints, so we need to assign a unique id to each one. Next we can calculate the intervals (in days) between the first version and any revised versions of the same preprint. I did this with an old school for-loop

# calculate difference in days between each version and initial version
df_multi$diff <- 0
for(i in 1:nrow(df_multi)) {
  thedoi <- df_multi$doi[i]
  version1 <- subset(df_multi, doi == thedoi & version == 1)
  if(nrow(version1) == 0) {
    df_multi$diff[i] <- NA
  } else {
    date1 <- version1$date
    difference <- df_multi$date[i] - date1
    df_multi$diff[i] <- difference
  }
}

We now have the data we need but it would be good to focus on the longest revision times first of all.

# order by longest revision
udois$maxrev <- 0
udois$maxlag <- 0
for(i in 1:nrow(udois)) {
  version1 <- subset(df_multi, id == i)
  revver <- max(version1$version)
  longest <- max(version1$diff)
  udois$maxrev[i] <- revver
  udois$maxlag[i] <- longest
}
udois <- udois[order(udois$maxlag, decreasing = TRUE),]
udois$rankrev <- seq_len(nrow(udois))
df_multi <- merge(df_multi,udois, by = "doi")

Now let’s look at the data:

p1 <- ggplot(df_multi, aes(x = version, y = diff, colour = doi)) +
  geom_line(aes(alpha = 0.2)) +
  labs(x = "Version", y = "Cumulative time (days)") +
  theme_cowplot() +
  theme(legend.position = "none")
p1
ggsave("Output/Plots/preprintVersions.png", p1, height = 5, width = 6, dpi = 300)

cbp1 <- c("#999999", "#E69F00", "#56B4E9", "#009E73",
          "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
cbp1_rotate <- c(cbp1,cbp1,cbp1,cbp1)

p2 <- ggplot(df_multi, aes(x = diff, y = rankrev, group = doi, color = revision)) +
  geom_line() +
  geom_point() +
  scale_color_manual(values = cbp1_rotate) +
  ylim(1,100) +
  labs(y = "Rank", x = "Cumulative time (days)") +
  theme_cowplot() +
  theme(legend.position = "none")
p2
ggsave("Output/Plots/preprintRevisions.png", p2, height = 8, width = 6, dpi = 300)

The first graph is very messy but it shows the cumulative time that preprint revisions take. It’s messy because there are a lot of preprints with short revisions and many preprints with low version numbers. Shallow lines indicate many revisions in a short space of time. Steep lines are few, long revisions.

The second graph shows the “Top 100” longest revisions on bioRxiv. The preprint with the longest interval between first deposit and the latest version is over 5 years! The colour coding shows each revision and how long it stayed current on bioRxiv.

So far we’ve looked at outliers. What about the whole dataset?

Published vs unpublished?

It’s useful at this point to segregate the data into published and unpublished preprints. Published preprints cannot be revised further (there are exceptions, but this is generally true), whereas unpublished preprints may be revised again.

# look at published vs unpublished, unpublished is character "NA"
df_pubstat <- data.frame(doi = df_multi$doi,
                         published = df_multi$published)
df_maxtime <- merge(udois, df_pubstat, by = "doi", all.x = TRUE)
df_maxtime <- df_maxtime[!duplicated(df_maxtime),]
df_maxtime$pnp <- as.factor(ifelse(df_maxtime$published == "NA", "unpublished", "published"))
df_maxtime$maxver <- as.factor(ifelse(df_maxtime$maxrev < 7, df_maxtime$maxrev, "> 6"))
df_maxtime$maxver <- factor(df_maxtime$maxver, levels = c(2,3,4,5,6, "> 6"))

With this bit of code we do the segregation and also make an ordered factor for preprints with a maximum version number of 2-6 or greater than 6. Now let’s look at the revision time between the deposit and the final (or latest) version.

p3 <- ggplot(df_maxtime, aes(x = maxlag)) +
  geom_histogram(binwidth = 31) +
  facet_wrap(~pnp) +
  labs(x = "Total time (days)", y = "Count") +
  theme_cowplot() +
  theme(legend.position = "none")
p3
ggsave("Output/Plots/preprintPubUnpub.png", p3, height = 6, width = 8, dpi = 300)

p4 <- ggplot(df_maxtime, aes(x = maxlag)) +
  geom_histogram(binwidth = 31) +
  facet_wrap(maxver ~ pnp, scales = "free_y") +
  labs(x = "Total time (days)", y = "Count") +
  theme_cowplot() +
  theme(legend.position = "none")
p4
ggsave("Output/Plots/preprintPubUnpubByRev.png", p4, height = 6, width = 8, dpi = 300)

The histograms show the total time from initial deposit until the final or latest version, broken down by latest version number and/or publication status. Bin widths are approximately 1 month.

So what are the times for revision?

# a few pesky NAs in the dataframe
df_maxtime$maxlag[is.na(df_maxtime$maxlag)] <- 0
# summary of median times
with(df_maxtime, tapply(df_maxtime$maxlag, list("Version"=maxver, "Published"=pnp), median))
       Published
Version published unpublished
    2          77        41.0
    3         160       140.0
    4         218       209.0
    5         255       254.5
    6         308       285.0
    > 6       300       388.0

with(df_maxtime, tapply(df_maxtime$maxlag, list("Published"=pnp), median))
Published
  published unpublished 
        104          68 

The unpublished preprints have shorter times to get to the same version number which sort of makes sense, but there is an issue here we need to deal with…

There is a large spike in the first month. These are preprints being very rapidly revised. I would suggest these are trivial revisions. For example, acknowledgements missing, a figure not formatted correctly, and so on. Let’s filter out revisions that are up to 10 days and assume that meaningful revisions must take longer.

df_maxtime_filt <- subset(df_maxtime, maxlag > 10)
with(df_maxtime_filt, tapply(df_maxtime_filt$maxlag, list("Version"=maxver, "Published"=pnp), median))
       Published
Version published unpublished
    2       111.0       106.0
    3       170.0       170.0
    4       222.5       226.0
    5       255.0       257.5
    6       308.0       285.0
    > 6     300.0       388.0

with(df_maxtime_filt, tapply(df_maxtime_filt$maxlag, list("Published"=pnp), median))
Published
  published unpublished 
        133         132 

Now the picture looks similar for published and unpublished versions at each version number. The median time for total revision is just over four months. This is obviously shorter than publication lag times at journals since it doesn’t include the reassessment of a paper and subsequent publication.

Conclusion

For bioRxiv preprints where there have been revisions posted greater than 10 days after the initial version, there is an average total revision time of over four months. For preprints with only one revision, it is around three months, with more revisions this time stretches out to eight months or more.

This is a complex dataset and I’ve noted some caveats as I went along. This analysis did not take into account any differences between categories which may be significant due to field-specific differences in preprinting and publishing behaviour. The most frequent revision time is short – within one month. I side-stepped this issue by assuming these revisions were trivial, but in some fields, meaningful revisions may take less time.

The biggest limitation with this dataset is that it is unclear when the initial deposit and final version occur in the life cycle of a paper. It is unlikely that all papers are deposited initially on bioRxiv before or concomitant with submission to a journal. It is even more difficult to know what the final/latest version represents. It could be the version just before resubmission to the journal that publishes the paper or it may be a revised version to start the process off again at a new journal.

The post title is from “Over and Over” by The Beat from their album Wha’ppen? I currently have 11 different tracks called “Over and Over” in my library, they could come in handy in the future!

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)