Searching for duplicate resource names in PMC article titles

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I enjoyed this article by Keith Bradnam, and the associated tweets, on the problem of duplicated names for bioinformatics software.

I figured that to some degree at least, we should be able to search for such instances, since the titles of published articles that describe software often follow a particular pattern. There may even be a grammatical term for it, but I’ll call it the announcement colon:

eDuS: Segmental Duplication Simulator
Reveel: large-scale population genotyping using low-coverage sequencing data
RNF: a general framework to evaluate NGS read mappers
Hammock: A Hidden Markov model-based peptide clustering algorithm to identify protein-interaction consensus motifs in large datasets

You get the idea. “XXX COLON a [METHOD] to [DO SOMETHING] using [SOME DATA].”

Let’s go in search of announcement colons, using titles from the PubMed Central dataset. You can find this mini-project at Github.

1. Download PMC data
I use wget. The compressed archives are still quite large (~ 3-5 GB), so this may take some time.

wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.A-B.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.C-H.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.I-N.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.O-Z.tar.gz

find ./ -name "*.tar.gz" -exec tar zxvf {} ;

2. Parse the titles
Now, of course there will be many article titles that contain a colon and are nothing to do with software names. We’ll worry about that later when we start counting things.

Quick and dirty Ruby code to 1. open and parse a PMC XML file; 2. extract PMC ID and title; 3. print out those titles starting with “anything followed by a colon”. It’s not the best way to generate tab-delimited output, but it works. Note that titles in PMC XML can contain line breaks, which we need to remove (by replacing with a space). The output file has 3 columns: PMC uid, the part of the title preceding the colon (we’ll call that the “pretitle”), and the full title.

#!/usr/bin/ruby

require "nokogiri"

f   = File.open(ARGV[0])
doc = Nokogiri::XML(f)
f.close

ameta  = doc.xpath("//article/front/article-meta")
pmc    = ameta.xpath("//article-id[@pub-id-type='pmc']").text.chomp
title  = ameta.xpath("//title-group/article-title").text.chomp

if title =~ /^(.*?):/
  r = [pmc, $1, title.gsub("n", " ")]
  puts r.join("t")
end

We can make that much quicker using GNU parallel. Assuming that the XML files were extracted into directory pmc under the current working directory:

find ./pmc -name "*.nxml" | parallel ./pmc2title.rb {} > pmctitles.tsv

3. Count the duplicate terms
Now we have something that R can read easily. As ever, some cleaning is necessary.

  1. the pretitle is converted to lower case, for counting
  2. the PMC dataset contains duplicate records, which can be removed using the UID
  3. After summing the pretitles, we select only those that occur 2 or more times and order by frequency
ti <- read.delim("pmctitles.tsv", header=FALSE, stringsAsFactors=FALSE)
colnames(ti)    <- c("uid", "pretitle", "title")
ti$pretitle.low <- tolower(ti$pretitle)

ti.uniq <- ti[!duplicated(ti[, "uid"]), ]
ti.cnt  <- as.data.frame(table(ti.uniq$pretitle.low), stringsAsFactors = FALSE)
ti.cnt  <- subset(ti.cnt, Freq > 1)
ti.cnt  <- ti.cnt[order(ti.cnt$Freq, decreasing = TRUE), ]

There are quite a few duplicated pretitles – too many to inspect quickly.

nrow(ti.cnt)
[1] 3318

So let’s assume, as is often the case, that software articles usually have one word before the colon and that word is the software name. Of course, there will be many instances where the word is not a software name. Let’s also assume that duplicate software names are unlikely to occur very many times; certainly less than 10 and perhaps less than 5.

ti.one  <- ti.cnt[-grep(" ", ti.cnt$Var1), ]

nrow(ti.one)
[1] 740

ti.one10 <- subset(ti.one, Freq < 11)

# most duplicates occur 2-3 times
table(ti.one10$Freq)

  2   3   4   5   6   7   8   9  10 
476 120  43  24  19   6   6   3   1

All that remains is to match the pretitles in ti.one10 with those in ti.uniq, write out the results and stare at them.

ti.in <- ti.in[order(ti.in$pretitle.low), ]
write.table(ti.in, file = "pmctitles_matched.tsv", sep = "t", quote = FALSE, 
row.names = FALSE, col.names = FALSE)

4. Did we find anything?
Sure did. There comes a point where manual curation is unavoidable so – here is the file of candidate duplicate names for software or computational resources. Note: there may be cases where the name is something else, such as a clinical trial or protocol.

Some were identified previously by Keith: comet, muscle, snap, medusa. Many from his list are missing, meaning either that the duplicated names are not in the PMC data or the procedure to extract them failed.

Plenty of new entries. Tempting to use “SNiPer” or “SNaP” for your SNPs, but think twice. Likewise, VIPR (3 entries). Who’d have thought that there’d be two unrelated COMBREX? And even venerable workflow framework Taverna has a competitor.

As Keith said, the take-home message is simply: do your research before you name things.


Filed under: open access, programming, R, ruby, statistics Tagged: duplicates, pmc

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)