Analysis of retractions in PubMed

Posted on November 30, 2010 by nsaunders in R bloggers, Uncategorized | 0 Comments

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As so often happens these days, a brief post at FriendFeed got me thinking about data analysis. Entitled “So how many retractions are there every year, anyway?”, the post links to this article at Retraction Watch. It discusses ways to estimate the number of retractions and in particular, a recent article in the Journal of Medical Ethics (subscription only, sorry) which addresses the issue.

As Christina pointed out in a comment at Retraction Watch, there are thousands of scientific journals of which PubMed indexes only a fraction. However, PubMed is relatively easy to analyse using a little Ruby and R. So, here we go…

Code and raw data used for this post are available at Github.

1. Searching for retractions
In the Journal of Medical Ethics article, the authors state: “Every research paper noted as retracted in the PubMed database from 2000 to 2010 was evaluated. PubMed was searched on 22 January 2010 with the limits of ‘items with abstracts, retracted publication, English.’ A total of 788 retracted papers were identified…”

Not a bad approach. There’s another way: at the PubMed website, find a retraction and examine the record in XML format. You’ll see this:

<PublicationTypeList>
  <PublicationType>Retraction of Publication</PublicationType>
</PublicationTypeList>

The equivalent in Medline format is:

PT  - Retraction of Publication

This means that retractions have a particular type: Publication Type, or PTYP for short. If you search at the PubMed website using the term “Retraction of Publication[Publication Type]“, you will retrieve (at the time of writing) ~ 1621 records.

2. Retrieving retraction counts by year
Armed with this information, we can modify the Ruby code that I’ve posted previously to retrieve total and retracted publications between 1900 and 2010. This generates a tab-delimited file with 3 columns: year, total publications and retracted publications.

3. Retraction count analysis
Here’s the R code to analyse the retraction counts. There are no recorded retractions until 1977, so we’ll start from that year.

First, a simple plot of retractions for each year. So, retractions are increasing rapidly. No surprise there, since the total number of publications per year is also increasing rapidly. We need some kind of normalization.	PubMed retractions 1977 – 2010
Chris got there first with this graphic, showing retractions each year per 100 000 publications. Here’s my version. Indeed, it seems that with each year, retractions constitute a greater proportion of publications for that year.	PubMed retractions 1977 – 2010 (per 100K by year)
Another way to examine the trend is to use the cumulative sum of both total publications and retractions over time. In other words for each year, instead of looking at the numbers for just that year, we look at the total records accumulated in PubMed to date. Here’s that plot. This shows a smoother upwards trend, with a rapid increase from 2005 onwards.	PubMed retractions 1977 – 2010 (per 100K, cumulative)
Finally, we can compare the growth rate of total and retracted publications. One way to do this is to choose 1977 as the baseline and for each year, calculate the percentage increase in both publication types relative to 1977. Here’s the result. This is somewhat alarming. Whilst there are about 4x as many total publications in Pubmed now as there were in 1977, the total number of retractions has risen almost 550x.	Percent increase relative to 1977, cumulative

4. Analysis of Medline data
Using the search term described earlier in the post to retrieve retractions, we can download a file in Medline format. Medline records contain various fields of interest, including the ROF (retraction of) line, describing the publication that was retracted.

Or – as it turns out in some cases – publications. One retraction record may include the retraction of several publications, as we can see with a simple grep:

grep -c "^PMID" retractions.medline && grep -c "^ROF" retractions.medline
1621
1705

We won’t worry about that too much, since the majority of retraction records reference one publication.

Here is some R code that performs two simple, similar analyses of the Medline file. First, the top 10 journals for retractions:

                            so Freq
667   Proc Natl Acad Sci U S A   54
707                    Science   52
590                     Nature   42
388                J Biol Chem   32
450                  J Immunol   28
157                       Cell   20
92  Biochem Biophys Res Commun   16
116                      Blood   16
413              J Clin Invest   15
566              Mol Cell Biol   15

A brief glance at that list suggests that higher impact factor = more retractions. We would want to know the total number of publications for those journals to make more sense of that.

Second, the top 10 countries:

              pl Freq
45 united states  856
12       england  373
28   netherlands   83
15       germany   47
23         japan   42
6          china   25
2      australia   19
24 korea (south)   19
10       denmark   17
42   switzerland   14

Not especially surprising; the ones with the most researchers/scientific output. Again, we’d want more data before drawing any conclusions.

Final thoughts

Analysis of all kinds of data from PubMed is relatively straightforward. As to the factors underlying the recent rise in retractions: the JME focuses on fraud. Your thoughts are welcome.
It strikes me that it would be relatively easy to build a web application (Rails, Heroku), which constantly monitors retraction data at PubMed and generates a variety of statistics and charts.
The post at Retraction Watch lists a variety of estimates for numbers of retractions: 328 from 1995-2004, 529 from 1988-2008 and, most amusingly, 95 in 2008 – for the entire Thomson Reuters Science Citation Index. Given that there are 237 records in PubMed alone for 2008, you have to wonder what the Times Higher Education Supplement paid for the latter study. And people wonder why we don’t trust impact factors.