[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

File this one under “has troubled me (and others) for some years now, let’s try to resolve it.”

Let’s use the excellent R/rentrez package to search PubMed for articles that were retracted in 2013.

library(rentrez)

es <- entrez_search("pubmed", ""Retracted Publication"[PTYP] 2013[PDAT]", usehistory = "y")
es$count # [1] 117  117 articles. Now let’s fetch the records in XML format. xml <- entrez_fetch("pubmed", WebEnv = es$WebEnv, query_key = es$QueryKey, rettype = "xml", retmax = es$count)


Next question: which XML element specifies the “Date of publication” (PDAT)?

To make a long story short: there are several nodes in PubMed XML that contain the word “Date”, but the one which looks most promising is named PubDate. Given that our search used the year (2013), you might think that years can be extracted using the XPath expression //PubDate/Year. You would be mostly, but not entirely right.

doc <- xmlTreeParse(xml, useInternalNodes = TRUE)
table(xpathSApply(doc, "//PubDate/Year", xmlValue))
# 2013 2014
#  111    2


Well, that’s confusing. Not only do we not get the expected total number of years (117), but two of them have the value 2014. Time to delve deeper into the nodes under PubDate.

children <- xpathSApply(doc, "//PubDate", xmlChildren)
table(names(unlist(children)))

#         Day MedlineDate       Month        Year
#          25           4          87         113

table(xpathSApply(doc, "//PubDate/MedlineDate", xmlValue))

# 2013 Jan-Mar 2013 May-Jun 2013 Nov-Dec 2013 Oct-Dec
#            1            1            1            1


Interesting. So in addition to //PubDate/Year, 4 records have a node named //PubDate/MedlineDate.

It’s also possible to retrieve records in docsum format, which is also XML but with a different structure. Here, PubDate is an attribute of an Item node.

ds <- entrez_fetch("pubmed", WebEnv = es$WebEnv, query_key = es$QueryKey,
rettype = "docsum", retmax = es\$count)
ds.doc <- xmlTreeParse(ds, useInternalNodes = TRUE)
table(xpathSApply(ds.doc, "//Item[@Name='PubDate']", xmlValue))

#         2013     2013 Apr   2013 Apr 1   2013 Apr 2     2013 Aug  2013 Aug 15  2013 Aug 29
#           23            7            1            1            2            2            1
#     2013 Dec   2013 Dec 1     2013 Feb  2013 Feb 26   2013 Feb 7     2013 Jan  2013 Jan 24
#            3            1            6            1            1           10            2
#   2013 Jan 3  2013 Jan 30   2013 Jan 7 2013 Jan-Mar     2013 Jul  2013 Jul 25     2013 Jun
#            1            1            1            1            4            1            3
#  2013 Jun 18   2013 Jun 5   2013 Jun 7     2013 Mar   2013 Mar 1  2013 Mar 12  2013 Mar 28
#            1            1            1            5            1            1            1
#   2013 Mar 9     2013 May   2013 May 1  2013 May 29   2013 May 6   2013 May 8   2013 May 9
#            1            4            3            1            1            2            1
#     2013 Nov 2013 Nov-Dec     2013 Oct 2013 Oct-Dec     2013 Sep  2013 Sep 30     2014 Feb
#            8            1            2            1            5            1            1
#     2014 Jan
#            1


A fair old mix of formats in there then, and still the issue of the 2014 years when we searched for PDAT = 2013. We can split on space to get years:

yr <- xpathSApply(ds.doc, "//Item[@Name='PubDate']", function(x) strsplit(xmlValue(x), " ")[[1]][1])
which(yr == "2014")
# [1] 16 26


And examine records 16 and 26:

xmlRoot(ds.doc)[[16]] # complete output not shown
# <DocSum>
#   <Id>24156249</Id>
#   <Item Name="PubDate" Type="Date">2014 Jan</Item>
#   <Item Name="EPubDate" Type="Date">2013 Oct 25</Item>

xmlRoot(ds.doc)[[26]] # complete output not shown
# <DocSum>
#   <Id>24001238</Id>
#   <Item Name="PubDate" Type="Date">2014 Feb</Item>
#   <Item Name="EPubDate" Type="Date">2013 Sep 4</Item>


Not every record has EPubDate. Is it simply the case that where it exists and is earlier than PubDate, then EPubDate == PDAT?

So we haven’t really resolved very much, have we?

• we started with the Entrez search term PDAT (Date of publication)
• both PubMed XML and DocSum contain something called PubDate
• in the former case, most child node names = Year, but some = MedlineDate
• we retrieve some records where PubDate year = 2014, even when searching for 2013[PDAT]

It appears that PDAT does not map consistently to any XML node in either XML or DocSum formats. It might be derived from (1) EPubDate, where that exists and is earlier than PubDate, or (2) PubDate, where EPubDate does not exist.

Filed under: bioinformatics, R, statistics Tagged: entrez, eutils, ncbi, pubmed, xml