Abstract word clouds using R

August 23, 2010

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

A recent question over at BioStar asked whether abstracts returned from a PubMed search could easily be visualised as “word clouds”, using Wordle.

This got me thinking about ways to solve the problem using R. Here’s my first attempt, which demonstrates some functions from the RCurl and XML packages.

update: corrected a couple of copy/paste errors in the code

First, install a couple of packages: snippets, which provides the cloud() function for plotting a word cloud and tm, a text-mining library:


Next, the code to search PubMed, fetch abstracts and generate a list of words:


# esearch
url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?"
q   <- "db=pubmed&term=saunders+nf[au]&usehistory=y"
esearch <- xmlTreeParse(getURL(paste(url, q, sep="")), useInternal = T)
webenv  <- xmlValue(getNodeSet(esearch, "//WebEnv")[[1]])
key     <- xmlValue(getNodeSet(esearch, "//QueryKey")[[1]])

# efetch
url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"
q   <- "db=pubmed&retmode=xml&rettype=abstract"
efetch <- xmlTreeParse(getURL(paste(url, q, "&WebEnv=", webenv, "&query_key=", key, sep="")), useInternal = T)
abstracts <- getNodeSet(efetch, "//AbstractText")

# words
abstracts <- sapply(abstracts, function(x) { xmlValue(x) } )
words <- tolower(unlist(lapply(abstracts, function(x) strsplit(x, " "))))


Word cloud for abstracts

Let’s run through that. First, load up the libraries (lines 1-4). Next, define an EUtils Esearch URL (lines 7-8). Use getURL() (RCurl) to fetch the search result in XML format and xmlTreeParse() (XML) to parse the result into a NodeSet object (line 9). Extract the content of the WebEnv and QueryKey tags, to use when we fetch the abstracts (lines 10-11).

To retrieve the abstracts: define an EUtils Efetch URL, fetch the XML and parse as before (lines 14-17). This time, the NodeSet object, abstracts, contains the AbstractText tags and their contents. We can run sapply on each abstract to pull out the text between the tags (line 20). Finally, we split each abstract into words by looking for spaces (” “), put all of the words in one big list and convert them all to lower-case, using the “one-liner” on line 20. Conversion to lower-case ensures that words are not counted twice (e.g. “The” and “the”).

That’s a good start, but there is still some work to do. For a start, many of the words are not strictly words, because they include punctuation symbols. We can get rid of the symbols using grep:

# remove parentheses, comma, [semi-]colon, period, quotation marks
words <- words[-grep("[\)\(,;:\.\'\"]", words)]

We’re probably not interested in “words” composed solely of numerals:

words <- words[-grep("^\d+$", words)]

We’re definitely not interested in commonly-used words such as: “a, and, the, we, that, which, was, those…” and so on. These are referred to as stopwords – and this is where the tm package is useful. It provides a list of stopwords, to which we can compare our word list and remove matches:

words <- words[!words %in% stopwords()]

OK – we are just about ready to plot the word cloud. Count them up using table(), remove those that occur only once and plot:

wt <- table(words)
wt <- wt[wt > 1]
cloud(wt, col = col.br(wt, fit=TRUE))

Result: see the graphic, above-right (click on it for the full-size version).

It’s a start, if not quite so attractive as a Wordle. The tm package looks worthy of further investigation; it contains many more functions than the simple use of stopwords() illustrated here.

Filed under: programming, R, statistics Tagged: abstracts, literature, text-mining, visualisation, word cloud

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , , ,

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training



CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)