Interestingly: the sentence adverbs of PubMed Central
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Scientific writing – by which I mean journal articles – is a strange business, full of arcane rules and conventions with origins that no-one remembers but to which everyone adheres.
I’ve always been amused by one particular convention: the sentence adverb. Used with a comma to make a point at the start of a sentence, as in these examples:
Surprisingly, we find that the execution of karyokinesis and cytokinesis is timely…
Grossly, the tumor is well circumscribed with fibrous capsule…
Correspondingly, the short-term Smad7 gene expression is graded…
The example that always makes me smile is interestingly. “This is interesting. You may not have realised that. So I said interestingly, just to make it clear.”
With that in mind, let’s go looking for sentence adverbs in article abstracts.
1. Download PubMed Central
Code and data for this post can be found at Github.
We need abstracts. One source of these is the PubMed Central (PMC) archive at the NCBI FTP site. Create a directory on your system to hold the files (e.g. db/pmc), move to it and:
wget http://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.A-B.tar.gz wget http://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.C-H.tar.gz wget http://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.I-N.tar.gz wget http://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.O-Z.tar.gz find ./ -name "*.tar.gz" -exec tar zxvf {} \;
Caution: the compressed archives are 1.5 – 2.5 GB and will take some time to download. They uncompress to about 47 GB of storage (at the time of writing).
2. Extract the sentence adverbs
The PMC files are in XML format. I’m not especially expert in XML parsing but when required, I reach for Ruby’s nokogiri. Here’s the complete code:
#!/usr/bin/ruby require "nokogiri" f = File.open(ARGV[0]) doc = Nokogiri::XML(f) f.close ameta = doc.xpath("//article/front/article-meta") pmc = ameta.xpath("//article-id[@pub-id-type='pmc']").text.chomp epub = ameta.xpath("//pub-date[@pub-type='epub']/year").text.chomp ppub = ameta.xpath("//pub-date[@pub-type='ppub']/year").text.chomp abs = ameta.xpath("//abstract").text.chomp titl = ameta.xpath("//title-group/article-title").text.chomp jour = doc.xpath("//article/front/journal-meta/journal-id[@journal-id-type='nlm-ta']").text.chomp abs.scan(/[A-Z][a-z]+ly,/m) { b = $~.begin(0) e = $~.end(0) a = $&.gsub(",", "") o = [pmc, jour.downcase, epub, ppub, a, b, e] puts o.join(",") }
The code uses XPath expressions to pull out abstracts and some other metadata from the PMC NXML files. It then searches the abstract for words that end with “ly,”. For reasons to be explained later, journal abbreviations are converted to lower-case and the position of the match is also recorded.
Save it as pmcly.rb and run it on the directory of PMC files like so:
find db/pmc -name "*.nxml" -exec pmcly.rb {} > adverbs.csv \;
This takes some time – overnight on my home machine. The result (first 10 lines of 92 940):
2751461,aaps j,2008,NA,Alternatively,350,364 2751463,aaps j,2008,NA,Unfortunately,315,329 2844510,aaps j,2010,NA,Generally,125,135 2751449,aaps j,2008,NA,Ideally,581,589 2751387,aaps j,2008,NA,Finally,1376,1384 2976997,aaps j,2010,NA,Finally,851,859 2976990,aaps j,2010,NA,Significantly,460,474 2751391,aaps j,2008,NA,Clearly,712,720 2751391,aaps j,2008,NA,Importantly,1175,1187 3385822,aaps j,2012,NA,Additionally,798,811
There will of course be false positives – words ending with “ly,” that are not adverbs. Some of these include: the month of July, the country of Italy, surnames such as Whitely, medical conditions such as renomegaly and typographical errors such as “Findingsinitially“. These examples are uncommon and I just ignore them where they occur. There will also be sentence adverbs that do not include the comma but for now, I’m restricting the analysis to the “more dramatic” form, with commas.
3. Clean up the mess
Skip to the last sentence in this section if you want to avoid the tedious details of data cleaning.
You’ll note that in the Ruby code, extracted text was passed through chomp to remove end of line characters. This was not the case originally, since I was not expecting to find line breaks in the tag contents. However, it’s always good to check that the lines in your CSV file contain the expected number of fields (7):
which(count.fields("data/adverbs.csv", sep = ",") != 7) [1] 55294 55295 55309 55310 55311 55312 55313 55314 55332 55333 55410 55411 [13] 55463 55464 55523 55524 55525 55526 55665 55666 55707 55708 56183 56184 [25] 56200 56201 56263 56264 57923 57924 57925 57926 57927 57928 57929 57930 [37] 57931 57932 57933 57934 57935 57936 57937 57938 57939 57940 57941 57942 [49] 57943 57944 57945 57946 57947 57948 57949 57950
Oh dear. Closer inspection indicates that we are looking at 28 pairs of adjacent lines. A quick look at the first 2 such lines in the CSV file:
tail -n+55294 adverbs.csv | head -2 540057,Nucleic Acids Res,2004 ,2005,Alternatively,946,960
Seems like our 7 fields are being broken by an unexpected new line. Surely, PubMed Central, you would not allow a line break in the pub-date/epub/year field? Let’s look at PMC 540057:
<pub-date pub-type="epub"> <day>17</day> <month>12</month> <year>2004 </year>
You would. OK then.
Rather than run the code again, I wrote an ugly fix in R. It identifies the offending lines and pastes them back together. When that’s done, it figures out unique occurrences of sentence adverbs by assuming that additional records with the same PMC uid, adverb and start/end position in an abstract are duplicates. That last step is required because there are files with duplicate contents in the PMC dataset; something that I only discovered when summing adverbs per PMC uid and realising that some occur twice.
Ultimately we end up with a CSV file, adverbs.uniq.csv, containing 91 618 unique and cleaned records.
4. Analysis
R code for analysing the adverbs is in the file adverbs.R. It contains 4 functions: for counting adverbs, plotting the top 20 as a bar plot, visualizing the top 100 as a word cloud and examining adverb occurrence by journal.
4.1 Top 20 adverbs
Let’s start with a basic “top N” list. Plot on the left, word cloud on the right.
It seems that the most popular use of the sentence adverb is to draw a close to the proceedings, with finally. The next most common uses: to indicate further points of interest (additionally), label results as interesting (interestingly) or important (importantly) and show that the authors are up to date with their reading (recently).
Findings are frequently surprising, but more often unfortunate than remarkable.
Altogether, there are 710 unique words in the adverb column from 91 618 records. Note that this includes the false positives mentioned previously, including a number of typographical errors. For example, we find both phylogenitically and phylogentically, but not the correct term, phylogenetically. Similarly, (see what I did there?) both intriguinly and intringuingly are present, but not the correct intriguingly.
4.2 Top 20 opening adverbs
If you’re anything like me, you have sat in front of a blank screen searching for that brilliant opening sentence for an article…and then typed:
“Recently,”
What inspiration did the authors of PMC find?
Same code, but applied to the subset of adverbs where match start = 0, i.e. the opening sentence of the abstract.
Looks like I’m not the only one who can’t do any better than recently. Authors also open by referring to previous work or assessing the current state of play.
4.3 The bad and the ugly
It’s possible to create all manner of adverbs by sticking -ly on the ends of things. This is rarely a good idea.
You might imagine that very long adverbs stand a good chance of being ugly. Let’s find the longest:
longest <- subset(adverbs.uniq, nchar(adverbs.uniq$adv) == max(nchar(adverbs.uniq$adv))) longest$adv # [1] "Electronmicroscopically"
You’d be correct.
You might also imagine that rarely-used adverbs stand a good chance of being ugly:
rarest <- subset(adverbs.freq, Freq == 1)
270 sentence adverbs appear only once in the dataset, so we’ll leave that as an “exercise for the reader”. Suffice to say that some of them include: endosonographically, ethnopharmacologically, ophthalmoscopically and tinctorially.
4.4 Adverbs by journal
It would be interesting to know whether adverbs are over-represented in particular journals. Simply counting the adverb by journal is no good, since some journals have many more articles in PMC than others. PLoS ONE, for example, accounts for almost 20% of all sentence adverb records:
head(jour[ order(jour$Freq, decreasing = T),]) # Var1 Freq # 2036 plos one 18285 # 1359 j cell biol 3053 # 1428 j exp med 2892 # 1897 nucleic acids res 2531 # 2033 plos genet 2076 # 2037 plos pathog 2043
Incidentally, this calculation is why we converted journal abbreviations to lower-case. PLoS ONE is also listed as PLoS One in PMC, but we want to count those names as one journal, hence plos one.
So, we need some sort of adjustment for total articles. I’ve gone with “occurrences per 100 adverbs.” That is: for every 100 sentence adverbs found in abstracts from a journal, how many occurrences of a particular adverb do we see? Furthermore, I’ve applied this metric to only those journals that have 100 or more sentence adverbs in the PMC dataset.
Let’s start with the surprising index.
surprising <- advIndex("Surprisingly", adverbs.jour, jour) head(surprising[, 2:4], 10) # Var2 Freq a # 1440517 plos biol 94 11.032864 # 1443357 plos genet 191 9.200385 # 1521457 sci rep 26 8.965517 # 1312007 nature 24 8.695652 # 1302777 nat commun 9 8.411215 # 964817 j cell biol 250 8.188667 # 208667 bmc biol 12 7.500000 # 1446197 plos pathog 151 7.391092 # 1013807 j exp med 194 6.708160 # 940677 j biol chem 29 6.487696
The message seems clear: go with a Nature or specialist PLoS journal if your results are surprising.
How about interesting?
head(interesting[, 2:4], 10) # Var2 Freq a # 225387 bmc immunol 32 20.91503 # 945327 j biomed sci 19 19.00000 # 1010647 j exp bot 40 18.26484 # 1278317 mol neurodegener 22 18.18182 # 1616277 virol j 82 18.14159 # 1262697 mol cancer 81 17.96009 # 806167 int j biol sci 23 17.69231 # 1435937 plant physiol 19 17.27273 # 233907 bmc microbiol 83 16.50099 # 1280447 mol pain 26 16.04938
No clear winner there. Anyone for remarkable results?
head(remarkable[, 2:4], 10) # Var2 Freq a # 1311933 nature 24 8.695652 # 1440443 plos biol 36 4.225352 # 686423 genome biol 15 4.000000 # 687133 genome biol evol 5 3.937008 # 1284243 mol syst biol 9 3.422053 # 1502923 retrovirology 10 3.246753 # 1443283 plos genet 67 3.227360 # 1521383 sci rep 8 2.758621 # 740383 hum mol genet 3 2.702703 # 545133 embo mol med 3 2.631579
Save your most remarkable results for submission to Nature.
Finally, what if it’s all a bit unfortunate?
# Var2 Freq a # 1444113 plos med 106 8.870293 # 168953 bioinformatics 14 7.329843 # 1600313 trials 7 6.930693 # 210133 bmc biotechnol 12 6.451613 # 1072783 j med case reports 12 6.030151 # 208003 bmc bioinformatics 74 5.501859 # 252023 bmc syst biol 17 5.396825 # 1444823 plos negl trop dis 34 5.082212 # 1610963 vasc health risk manag 5 4.901961 # 186703 biomed eng online 5 4.761905
I’m a little surprised that bioinformatics features so prominently. What could be so unfortunate? The results of applying your method to real data, as opposed to simulated or training data?
4.5 Most sentence adverbs in an abstract
Who got a bit carried away with the sentence adverbs? Easy to find out by counting up the PMC uids:
pmc <- as.data.frame(table(adverbs.uniq$pmc)) pmc <- pmc[ order(pmc$Freq, decreasing = T),] head(pmc) # Var1 Freq # 11769 2214797 7 # 27434 2873921 7 # 39930 3130555 7 # 42598 3173291 7 # 49235 3280967 7 # 13 17814 6
There are five abstracts that contain 7 sentence adverbs. Going back to the PMC website, we see that these are:
- IFN-γ Mediates the Rejection of Haematopoietic Stem Cells in IFN-γR1-Deficient Hosts (the editors’ summary is to blame for that one)
- Leishmania donovani Isolates with Antimony-Resistant but Not -Sensitive Phenotype Inhibit Sodium Antimony Gluconate-Induced Dendritic Cell Activation
- Quantitative analysis of transient and sustained transforming growth factor-β signaling dynamics (oddly, far less occurrences in the current online version)
- Review of juxtaglomerular cell tumor with focus on pathobiological aspect
- Nondisjunction of a Single Chromosome Leads to Breakage and Activation of DNA Damage Checkpoint in G2
Summary
As usual, this analysis is intended to be a bit of fun. Next time you’re writing that article though, ask yourself: is that sentence enhanced by the sentence adverb? Or are you simply following convention?
Filed under: R, research diary, ruby, statistics Tagged: adverbs, pubmed central, text-mining
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.