Interestingly: the sentence adverbs of PubMed Central

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Scientific writing – by which I mean journal articles – is a strange business, full of arcane rules and conventions with origins that no-one remembers but to which everyone adheres.

I’ve always been amused by one particular convention: the sentence adverb. Used with a comma to make a point at the start of a sentence, as in these examples:

Surprisingly, we find that the execution of karyokinesis and cytokinesis is timely…
Grossly, the tumor is well circumscribed with fibrous capsule…
Correspondingly, the short-term Smad7 gene expression is graded…

The example that always makes me smile is interestingly. “This is interesting. You may not have realised that. So I said interestingly, just to make it clear.”

With that in mind, let’s go looking for sentence adverbs in article abstracts.

1. Download PubMed Central
Code and data for this post can be found at Github.

We need abstracts. One source of these is the PubMed Central (PMC) archive at the NCBI FTP site. Create a directory on your system to hold the files (e.g. db/pmc), move to it and:

find ./ -name "*.tar.gz" -exec tar zxvf {} \;

Caution: the compressed archives are 1.5 – 2.5 GB and will take some time to download. They uncompress to about 47 GB of storage (at the time of writing).

2. Extract the sentence adverbs
The PMC files are in XML format. I’m not especially expert in XML parsing but when required, I reach for Ruby’s nokogiri. Here’s the complete code:


require "nokogiri"

f   =[0])
doc = Nokogiri::XML(f)

ameta = doc.xpath("//article/front/article-meta")
pmc   = ameta.xpath("//article-id[@pub-id-type='pmc']").text.chomp
epub  = ameta.xpath("//pub-date[@pub-type='epub']/year").text.chomp
ppub  = ameta.xpath("//pub-date[@pub-type='ppub']/year").text.chomp
abs   = ameta.xpath("//abstract").text.chomp
titl  = ameta.xpath("//title-group/article-title").text.chomp
jour  = doc.xpath("//article/front/journal-meta/journal-id[@journal-id-type='nlm-ta']").text.chomp

abs.scan(/[A-Z][a-z]+ly,/m) {
  b = $~.begin(0)
  e = $~.end(0)
  a = $&.gsub(",", "")
  o = [pmc, jour.downcase, epub, ppub, a, b, e]
  puts o.join(",")

The code uses XPath expressions to pull out abstracts and some other metadata from the PMC NXML files. It then searches the abstract for words that end with “ly,”. For reasons to be explained later, journal abbreviations are converted to lower-case and the position of the match is also recorded.

Save it as pmcly.rb and run it on the directory of PMC files like so:

find db/pmc -name "*.nxml" -exec pmcly.rb {} > adverbs.csv \;

This takes some time – overnight on my home machine. The result (first 10 lines of 92 940):

2751461,aaps j,2008,NA,Alternatively,350,364
2751463,aaps j,2008,NA,Unfortunately,315,329
2844510,aaps j,2010,NA,Generally,125,135
2751449,aaps j,2008,NA,Ideally,581,589
2751387,aaps j,2008,NA,Finally,1376,1384
2976997,aaps j,2010,NA,Finally,851,859
2976990,aaps j,2010,NA,Significantly,460,474
2751391,aaps j,2008,NA,Clearly,712,720
2751391,aaps j,2008,NA,Importantly,1175,1187
3385822,aaps j,2012,NA,Additionally,798,811

There will of course be false positives – words ending with “ly,” that are not adverbs. Some of these include: the month of July, the country of Italy, surnames such as Whitely, medical conditions such as renomegaly and typographical errors such as “Findingsinitially“. These examples are uncommon and I just ignore them where they occur. There will also be sentence adverbs that do not include the comma but for now, I’m restricting the analysis to the “more dramatic” form, with commas.

3. Clean up the mess
Skip to the last sentence in this section if you want to avoid the tedious details of data cleaning.

You’ll note that in the Ruby code, extracted text was passed through chomp to remove end of line characters. This was not the case originally, since I was not expecting to find line breaks in the tag contents. However, it’s always good to check that the lines in your CSV file contain the expected number of fields (7):

which(count.fields("data/adverbs.csv", sep = ",") != 7)
 [1] 55294 55295 55309 55310 55311 55312 55313 55314 55332 55333 55410 55411
[13] 55463 55464 55523 55524 55525 55526 55665 55666 55707 55708 56183 56184
[25] 56200 56201 56263 56264 57923 57924 57925 57926 57927 57928 57929 57930
[37] 57931 57932 57933 57934 57935 57936 57937 57938 57939 57940 57941 57942
[49] 57943 57944 57945 57946 57947 57948 57949 57950

Oh dear. Closer inspection indicates that we are looking at 28 pairs of adjacent lines. A quick look at the first 2 such lines in the CSV file:

tail -n+55294 adverbs.csv | head -2
540057,Nucleic Acids Res,2004

Seems like our 7 fields are being broken by an unexpected new line. Surely, PubMed Central, you would not allow a line break in the pub-date/epub/year field? Let’s look at PMC 540057:

<pub-date pub-type="epub">

You would. OK then.

Rather than run the code again, I wrote an ugly fix in R. It identifies the offending lines and pastes them back together. When that’s done, it figures out unique occurrences of sentence adverbs by assuming that additional records with the same PMC uid, adverb and start/end position in an abstract are duplicates. That last step is required because there are files with duplicate contents in the PMC dataset; something that I only discovered when summing adverbs per PMC uid and realising that some occur twice.

Ultimately we end up with a CSV file, adverbs.uniq.csv, containing 91 618 unique and cleaned records.

4. Analysis
R code for analysing the adverbs is in the file adverbs.R. It contains 4 functions: for counting adverbs, plotting the top 20 as a bar plot, visualizing the top 100 as a word cloud and examining adverb occurrence by journal.

4.1 Top 20 adverbs
Let’s start with a basic “top N” list. Plot on the left, word cloud on the right.


Top 20 sentence adverbs in PubMed Central abstracts


Top 100 sentence adverbs in PubMed Central abstracts

It seems that the most popular use of the sentence adverb is to draw a close to the proceedings, with finally. The next most common uses: to indicate further points of interest (additionally), label results as interesting (interestingly) or important (importantly) and show that the authors are up to date with their reading (recently).

Findings are frequently surprising, but more often unfortunate than remarkable.

Altogether, there are 710 unique words in the adverb column from 91 618 records. Note that this includes the false positives mentioned previously, including a number of typographical errors. For example, we find both phylogenitically and phylogentically, but not the correct term, phylogenetically. Similarly, (see what I did there?) both intriguinly and intringuingly are present, but not the correct intriguingly.

4.2 Top 20 opening adverbs
If you’re anything like me, you have sat in front of a blank screen searching for that brilliant opening sentence for an article…and then typed:


What inspiration did the authors of PMC find?

Same code, but applied to the subset of adverbs where match start = 0, i.e. the opening sentence of the abstract.


Top 20 sentence adverbs that open abstracts in PubMed Central


Top 100 sentence adverbs that open abstracts in PubMed Central

Looks like I’m not the only one who can’t do any better than recently. Authors also open by referring to previous work or assessing the current state of play.

4.3 The bad and the ugly
It’s possible to create all manner of adverbs by sticking -ly on the ends of things. This is rarely a good idea.

You might imagine that very long adverbs stand a good chance of being ugly. Let’s find the longest:

longest <- subset(adverbs.uniq, nchar(adverbs.uniq$adv) == max(nchar(adverbs.uniq$adv)))
# [1] "Electronmicroscopically"

You’d be correct.

You might also imagine that rarely-used adverbs stand a good chance of being ugly:

rarest  <- subset(adverbs.freq, Freq == 1)

270 sentence adverbs appear only once in the dataset, so we’ll leave that as an “exercise for the reader”. Suffice to say that some of them include: endosonographically, ethnopharmacologically, ophthalmoscopically and tinctorially.

4.4 Adverbs by journal
It would be interesting to know whether adverbs are over-represented in particular journals. Simply counting the adverb by journal is no good, since some journals have many more articles in PMC than others. PLoS ONE, for example, accounts for almost 20% of all sentence adverb records:

head(jour[ order(jour$Freq, decreasing = T),])
#                    Var1  Freq
# 2036           plos one 18285
# 1359        j cell biol  3053
# 1428          j exp med  2892
# 1897  nucleic acids res  2531
# 2033         plos genet  2076
# 2037        plos pathog  2043

Incidentally, this calculation is why we converted journal abbreviations to lower-case. PLoS ONE is also listed as PLoS One in PMC, but we want to count those names as one journal, hence plos one.

So, we need some sort of adjustment for total articles. I’ve gone with “occurrences per 100 adverbs.” That is: for every 100 sentence adverbs found in abstracts from a journal, how many occurrences of a particular adverb do we see? Furthermore, I’ve applied this metric to only those journals that have 100 or more sentence adverbs in the PMC dataset.

Let’s start with the surprising index.

surprising  <- advIndex("Surprisingly", adverbs.jour, jour)
head(surprising[, 2:4], 10)
#                Var2 Freq         a
# 1440517   plos biol   94 11.032864
# 1443357  plos genet  191  9.200385
# 1521457     sci rep   26  8.965517
# 1312007      nature   24  8.695652
# 1302777  nat commun    9  8.411215
# 964817  j cell biol  250  8.188667
# 208667     bmc biol   12  7.500000
# 1446197 plos pathog  151  7.391092
# 1013807   j exp med  194  6.708160
# 940677  j biol chem   29  6.487696

The message seems clear: go with a Nature or specialist PLoS journal if your results are surprising.

How about interesting?

head(interesting[, 2:4], 10)
#                     Var2 Freq        a
# 225387       bmc immunol   32 20.91503
# 945327      j biomed sci   19 19.00000
# 1010647        j exp bot   40 18.26484
# 1278317 mol neurodegener   22 18.18182
# 1616277          virol j   82 18.14159
# 1262697       mol cancer   81 17.96009
# 806167    int j biol sci   23 17.69231
# 1435937    plant physiol   19 17.27273
# 233907     bmc microbiol   83 16.50099
# 1280447         mol pain   26 16.04938

No clear winner there. Anyone for remarkable results?

head(remarkable[, 2:4], 10)
#                     Var2 Freq        a
# 1311933           nature   24 8.695652
# 1440443        plos biol   36 4.225352
# 686423       genome biol   15 4.000000
# 687133  genome biol evol    5 3.937008
# 1284243    mol syst biol    9 3.422053
# 1502923    retrovirology   10 3.246753
# 1443283       plos genet   67 3.227360
# 1521383          sci rep    8 2.758621
# 740383     hum mol genet    3 2.702703
# 545133      embo mol med    3 2.631579

Save your most remarkable results for submission to Nature.

Finally, what if it’s all a bit unfortunate?

#                           Var2 Freq        a
# 1444113               plos med  106 8.870293
# 168953          bioinformatics   14 7.329843
# 1600313                 trials    7 6.930693
# 210133          bmc biotechnol   12 6.451613
# 1072783     j med case reports   12 6.030151
# 208003      bmc bioinformatics   74 5.501859
# 252023           bmc syst biol   17 5.396825
# 1444823     plos negl trop dis   34 5.082212
# 1610963 vasc health risk manag    5 4.901961
# 186703       biomed eng online    5 4.761905

I’m a little surprised that bioinformatics features so prominently. What could be so unfortunate? The results of applying your method to real data, as opposed to simulated or training data?

4.5 Most sentence adverbs in an abstract

Who got a bit carried away with the sentence adverbs? Easy to find out by counting up the PMC uids:

pmc <-$pmc))
pmc <- pmc[ order(pmc$Freq, decreasing = T),]
#          Var1 Freq
# 11769 2214797    7
# 27434 2873921    7
# 39930 3130555    7
# 42598 3173291    7
# 49235 3280967    7
# 13      17814    6

There are five abstracts that contain 7 sentence adverbs. Going back to the PMC website, we see that these are:

  1. IFN-γ Mediates the Rejection of Haematopoietic Stem Cells in IFN-γR1-Deficient Hosts (the editors’ summary is to blame for that one)
  2. Leishmania donovani Isolates with Antimony-Resistant but Not -Sensitive Phenotype Inhibit Sodium Antimony Gluconate-Induced Dendritic Cell Activation
  3. Quantitative analysis of transient and sustained transforming growth factor-β signaling dynamics (oddly, far less occurrences in the current online version)
  4. Review of juxtaglomerular cell tumor with focus on pathobiological aspect
  5. Nondisjunction of a Single Chromosome Leads to Breakage and Activation of DNA Damage Checkpoint in G2

As usual, this analysis is intended to be a bit of fun. Next time you’re writing that article though, ask yourself: is that sentence enhanced by the sentence adverb? Or are you simply following convention?

Filed under: R, research diary, ruby, statistics Tagged: adverbs, pubmed central, text-mining

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)