Abstract word clouds using R

Posted on August 23, 2010 by nsaunders in R bloggers | 0 Comments

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A recent question over at BioStar asked whether abstracts returned from a PubMed search could easily be visualised as “word clouds”, using Wordle.

This got me thinking about ways to solve the problem using R. Here’s my first attempt, which demonstrates some functions from the RCurl and XML packages.

update: corrected a couple of copy/paste errors in the code

First, install a couple of packages: snippets, which provides the cloud() function for plotting a word cloud and tm, a text-mining library:

install.packages('snippets',,'http://www.rforge.net/')
install.packages('tm')

Next, the code to search PubMed, fetch abstracts and generate a list of words:

library(RCurl)
library(XML)
library(snippets)
library(tm)

# esearch
url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?"
q   <- "db=pubmed&term=saunders+nf[au]&usehistory=y"
esearch <- xmlTreeParse(getURL(paste(url, q, sep="")), useInternal = T)
webenv  <- xmlValue(getNodeSet(esearch, "//WebEnv")[[1]])
key     <- xmlValue(getNodeSet(esearch, "//QueryKey")[[1]])

# efetch
url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"
q   <- "db=pubmed&retmode=xml&rettype=abstract"
efetch <- xmlTreeParse(getURL(paste(url, q, "&WebEnv=", webenv, "&query_key=", key, sep="")), useInternal = T)
abstracts <- getNodeSet(efetch, "//AbstractText")

# words
abstracts <- sapply(abstracts, function(x) { xmlValue(x) } )
words <- tolower(unlist(lapply(abstracts, function(x) strsplit(x, " "))))

Word cloud for abstracts

Let’s run through that. First, load up the libraries (lines 1-4). Next, define an EUtils Esearch URL (lines 7-8). Use getURL() (RCurl) to fetch the search result in XML format and xmlTreeParse() (XML) to parse the result into a NodeSet object (line 9). Extract the content of the WebEnv and QueryKey tags, to use when we fetch the abstracts (lines 10-11).

To retrieve the abstracts: define an EUtils Efetch URL, fetch the XML and parse as before (lines 14-17). This time, the NodeSet object, abstracts, contains the AbstractText tags and their contents. We can run sapply on each abstract to pull out the text between the tags (line 20). Finally, we split each abstract into words by looking for spaces (” “), put all of the words in one big list and convert them all to lower-case, using the “one-liner” on line 20. Conversion to lower-case ensures that words are not counted twice (e.g. “The” and “the”).

That’s a good start, but there is still some work to do. For a start, many of the words are not strictly words, because they include punctuation symbols. We can get rid of the symbols using grep:

# remove parentheses, comma, [semi-]colon, period, quotation marks
words <- words[-grep("[\)\(,;:\.\'\"]", words)]

We’re probably not interested in “words” composed solely of numerals:

words <- words[-grep("^\d+$", words)]

We’re definitely not interested in commonly-used words such as: “a, and, the, we, that, which, was, those…” and so on. These are referred to as stopwords – and this is where the tm package is useful. It provides a list of stopwords, to which we can compare our word list and remove matches:

words <- words[!words %in% stopwords()]

OK – we are just about ready to plot the word cloud. Count them up using table(), remove those that occur only once and plot:

wt <- table(words)
wt <- wt[wt > 1]
cloud(wt, col = col.br(wt, fit=TRUE))

Result: see the graphic, above-right (click on it for the full-size version).

It’s a start, if not quite so attractive as a Wordle. The tm package looks worthy of further investigation; it contains many more functions than the simple use of stopwords() illustrated here.