Findings increasingly novel, scientists say…

October 29, 2010

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

…was the tongue-in-cheek title of an image that I posted to Twitpic this week. It shows the usage of the word “novel” in PubMed article titles over time. As someone correctly pointed out at FriendFeed, it needs to be corrected for total publications per year.

It was inspired by a couple of items that caught my attention. First, a question at BioStar with the self-explanatory title Locations of plots of quantities of publicly available biological data. Second, an item at FriendFeed musing on the (over?) use of the word “insight” in scientific publications.

I’m sure that quite recently, I’ve read a letter to a journal which analysed the use of phrases such as “novel insights” in articles over time, but it’s currently eluding my search skills. So here’s my simple roll-your-own approach, using a little Ruby and R.

Initially, I entered “novel[Title]” at the PubMed website, download all 143 031 results in Medline format and parsed the “DP” (publication date) field. Useful, in that I learned the earliest title (1845); inefficient, in that the resulting download is ~ 397 MB.

Fortunately, BioRuby comes with a nice set of methods for search and retrieval from the NCBI Entrez databases, including esearch_count() – as the name suggests, it simply counts returned results for a query.

So, to search pubmed for (1) all articles published from 1845 – 2009 and (2) those articles with the word “novel” in title or abstract is as simple as this:


require "rubygems"
require "bio"

Bio::NCBI.default_email = "[email protected]"
ncbi =

1845.upto(2009) do |year|
  all   = ncbi.esearch_count("#{year}[dp]", {"db" => "pubmed"})
  novel = ncbi.esearch_count("novel[tiab] #{year}[dp]", {"db" => "pubmed"})
  puts "#{year}t#{all}t#{novel}"

Save and run that as pmnovel.rb > pmdata.txt. Obviously, we’re having a bit of fun here. You could search for any terms that you like and in a real script, you’d probably want to specify the terms and date range as command-line options.

Next, load the tab-delimited output file into R for some simple plotting.

pmdata <- read.table("pmdata.txt", sep = "t")
colnames(pmdata) <- c("year", "total", "novel")
pmdata$freq <- pmdata$novel/pmdata$total
# make year = end of year; then make it a real date
pmdata$year <- paste(pmdata$year, "12", "31", sep = "-")
pmdata$year <- as.Date(pmdata$year)
# reshape the data and plot each variable
pm <- melt(pmdata, id = "year")
png(file = "pmdata.png", width = 800, height = 600)
print(ggplot(em, aes(year, value)) + geom_line(aes(color = factor(variable))) +
scale_x_date(format = "%Y", major = "15 years") + opts(title = "Novelty 1845 - 2009") + facet_grid(variable ~ ., scale = "free_y") + scale_colour_discrete(legend = FALSE))
And here’s the result (click for full-size version).
There you have it. We see a steady post-WWII increase in total publications (top panel), increasing more sharply around 1995. The exponential increase in “novel” findings (middle panel) looks like it begins in the early 1980s. And the fraction of total publications that are “novel” (bottom panel) also begins to increase in the 1980s and is now at an all-time high. Last year, ~ 6.1% of findings were “novel”, compared with the all-time proportion – sum(pmdata$novel)/sum(pmdata$total) of ~ 2.3%.

Exciting times 😉


PubMed novelty, 1845 – 2009

Filed under: computing, publications, R, ruby, statistics Tagged: databases, ggplot2, pubmed

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , , ,

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)