Just how many retracted articles are there in PubMed anyway?

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am forever returning to PubMed data, downloaded as XML, trying to extract information from it and becoming deeply confused in the process.

Take the seemingly-simple question “how many retracted articles are there in PubMed?”

Well, one way is to search for records with the publication type “Retracted Article”. As of right now, that returns a count of 3550.

library(rentrez)

retracted <- entrez_search("pubmed", ""Retracted Publication"[PTYP]")
retracted$count
[1] "3550"

Another starting point is retraction notices – the publications which announce retractions. We search for those using the type “Retraction of Publication”.

retractions <- entrez_search("pubmed", ""Retraction of Publication"[PTYP]")
retractions$count
[1] "3769"

So there are more retraction notices than retracted articles. Furthermore, a single retraction notice can refer to more than one retracted article. If we download all retraction notices as PubMed XML (file retractionOf.xml), we see that the retracted articles referred to by a retraction notice are stored under the node named CommentsCorrectionsList:

        <CommentsCorrectionsList>
            <CommentsCorrections RefType="RetractionOf">
                <RefSource>Ochalski ME, Shuttleworth JJ, Chu T, Orwig KE. Fertil Steril. 2011 Feb;95(2):819-22</RefSource>
                <PMID Version="1">20889152</PMID>
            </CommentsCorrections>
        </CommentsCorrectionsList>

There are retraction notices without a CommentsCorrectionsList. Where it is present, there are CommentsCorrections without PMID but always (I think) with RefSource. So we can count up the retracted articles referred to by retraction notices like this:

doc.retOf <- xmlTreeParse("retractionOf.xml", useInternalNodes = TRUE)
ns.retOf <- getNodeSet(doc.retOf, "//MedlineCitation")
sources.retOf <- lapply(ns.retOf, function(x) xpathSApply(x, ".//CommentsCorrections[@RefType='RetractionOf']/RefSource", xmlValue))

# count RefSource per retraction notice - first 10
head(sapply(sources.retOf, length), 10)
# [1] 0 1 1 1 1 1 1 1 1 1

# total RefSource
sum(sapply(sources, length))
# [1] 3898

It appears then that retraction notices refer to 3 898 articles, but only 3 550 of type “Retracted Publication” are currently indexed in PubMed. Next question: of the PMIDs for retracted articles linked to from retraction notices, how many match up to the PMID list found in the downloaded PubMed XML file for all “retracted article” (retracted.xml) ?

# "retracted publication"
doc.retd <- xmlTreeParse("retracted.xml", useInternalNodes = TRUE)
pmid.retd <- xpathSApply(doc.retd, "//MedlineCitation/PMID", xmlValue)
# "retraction of publication"
pmid.retOf <- lapply(ns.retOf, function(x) xpathSApply(x, ".//CommentsCorrections[@RefType='RetractionOf']/PMID", xmlValue))

# count PMIDs linked to from retraction notice
sum(sapply(pmid.retOf, length))
# [1] 3524

# and how many correspond with "retracted article"
length(which(unlist(pmid.retOf) %in% pmid.retd))
# [1] 3524

So there are, apparently, 26 (3550 – 3524) retracted articles that have a PMID, but that PMID is not referred to in a retraction notice.

In summary
It’s like the old “how long is a piece of string”, isn’t it. To summarise, as of this moment:

  • PubMed contains 3 769 retraction notices
  • Those notices reference 3 898 sources, of which 3 524 have PMIDs
  • A further 26 retracted articles have a PMID not referenced by a retraction notice

What do we make of the (3898 – 3550) = 348 articles referenced by a retraction notice, but not indexed by PubMed? Could they be in journals that were not indexed when the article was published, but indexing began prior to publication of the retraction notice?

You can see from all this that linking retraction notices with the associated retracted articles is not easy. And if you want to do interesting analyses such as time to retraction – well, don’t even get me started on PubMed dates…


Filed under: bioinformatics, programming, publications, R, statistics Tagged: ncbi, pubmed, retraction

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)