Findings increasingly novel, scientists say…

Posted on October 29, 2010 by nsaunders in R bloggers, Uncategorized | 0 Comments

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

…was the tongue-in-cheek title of an image that I posted to Twitpic this week. It shows the usage of the word “novel” in PubMed article titles over time. As someone correctly pointed out at FriendFeed, it needs to be corrected for total publications per year.

It was inspired by a couple of items that caught my attention. First, a question at BioStar with the self-explanatory title Locations of plots of quantities of publicly available biological data. Second, an item at FriendFeed musing on the (over?) use of the word “insight” in scientific publications.

I’m sure that quite recently, I’ve read a letter to a journal which analysed the use of phrases such as “novel insights” in articles over time, but it’s currently eluding my search skills. So here’s my simple roll-your-own approach, using a little Ruby and R.

Initially, I entered “novel[Title]” at the PubMed website, download all 143 031 results in Medline format and parsed the “DP” (publication date) field. Useful, in that I learned the earliest title (1845); inefficient, in that the resulting download is ~ 397 MB.

Fortunately, BioRuby comes with a nice set of methods for search and retrieval from the NCBI Entrez databases, including esearch_count() – as the name suggests, it simply counts returned results for a query.

So, to search pubmed for (1) all articles published from 1845 – 2009 and (2) those articles with the word “novel” in title or abstract is as simple as this:

#!/usr/bin/ruby

require "rubygems"
require "bio"

Bio::NCBI.default_email = "[email protected]"
ncbi = Bio::NCBI::REST.new

1845.upto(2009) do |year|
  all   = ncbi.esearch_count("#{year}[dp]", {"db" => "pubmed"})
  novel = ncbi.esearch_count("novel[tiab] #{year}[dp]", {"db" => "pubmed"})
  puts "#{year}t#{all}t#{novel}"
end

Save and run that as pmnovel.rb > pmdata.txt. Obviously, we’re having a bit of fun here. You could search for any terms that you like and in a real script, you’d probably want to specify the terms and date range as command-line options.

Next, load the tab-delimited output file into R for some simple plotting.

library(ggplot2)
pmdata <- read.table("pmdata.txt", sep = "t")
colnames(pmdata) <- c("year", "total", "novel")
pmdata$freq <- pmdata$novel/pmdata$total
# make year = end of year; then make it a real date
pmdata$year <- paste(pmdata$year, "12", "31", sep = "-")
pmdata$year <- as.Date(pmdata$year)
# reshape the data and plot each variable
pm <- melt(pmdata, id = "year")
png(file = "pmdata.png", width = 800, height = 600)
print(ggplot(em, aes(year, value)) + geom_line(aes(color = factor(variable))) +
scale_x_date(format = "%Y", major = "15 years") + opts(title = "Novelty 1845 - 2009") + facet_grid(variable ~ ., scale = "free_y") + scale_colour_discrete(legend = FALSE))
dev.off()

And here’s the result (click for full-size version).
There you have it. We see a steady post-WWII increase in total publications (top panel), increasing more sharply around 1995. The exponential increase in “novel” findings (middle panel) looks like it begins in the early 1980s. And the fraction of total publications that are “novel” (bottom panel) also begins to increase in the 1980s and is now at an all-time high. Last year, ~ 6.1% of findings were “novel”, compared with the all-time proportion – sum(pmdata$novel)/sum(pmdata$total) of ~ 2.3%.

Exciting times 😉