Editor's note: This is the first in a series of posts from rOpenSci's recent hackathon.
I recently had the pleasure of participating in rOpenSci's hackathon. To be honest, I was quite nervous to work among such notables, but I immediately felt welcome thanks to a warm and personable group. Alyssa Frazee has a great post summarizing the event, so check that out if you haven't already. Once again, many thanks to rOpenSci for making it possible!
In addition to learning and socializing at the hackathon, I wanted to ensure my time was productive, so I worked on a mini-project related to my research in text mining. rOpenSci has plethora of R packages for extracting literary content off the web, including elife, which is a lightweight interface to the elife API. This package is not yet available on CRAN, but we can easily install from GitHub thanks to devtools.
Installing the package
library(devtools) install_github("ropensci/elife") library(elife)
Brief Overview of Topic Models
My research in text mining is focused on a particular type of topic model known as Latent Dirichlet Allocation (LDA). In general, a topic model discovers topics (e.g., hidden themes) within a collection of documents. For example, if a given document is generated from a hypothetical "statistics topic", there might be a 10% chance a given word in that document is "model", a 5% chance that word is "probability", a 1% that word is "algorithm", etc. Whereas, if a document is generated from a hypothetical "computer science topic", there might be a 4% chance a given word in that document is "model", a 2% chance that word is "probability", a 16% that word is "algorithm", etc. In other words, each topic is defined by a probability mass function over each possible word.
LDA takes this example one step further and allows for each document to be generated from a mixture of topics. For example, a particular document could be 60% statistics, 10% computer science, 20% mathematics, etc. Whereas, a different document could be 30% statistics, 30% computer science, 15% mathematics, etc. Within the LDA literature, fitting models to abstracts of academic articles is quite common, so I thought it would be neat to do the same with abstracts from elife articles.
Get all the elife abstracts!
In order to grab all the abstracts, first we'll grab all the DOIs that point to currently available articles. Note that we can do more complicated queries of specific articles with
help(searchelife) page has some nice examples).
dois <- searchelife(terms = "*", searchin = "article_title", boolean = "matches")
dois can now be used to obtain all sorts of meta data associated with these articles using
elife_doi. In this case, I just want the abstracts.
abs <- sapply(dois, function(x) elife_doi(x, ret = "abstract"))
From here, we have what we need to fit the topic model. I don't want to focus on technical details here, but if you are interested in the statistics involved, I recommend reading my post on xkcd comics. This post also covers the method I use to determine an optimal number of topics. I've provided all the code used to fit the model here, but let's skip to the fun part and jump right into exploring the model output.
The window below is an interactive visualization of the LDA output derived from elife abstracts. The aim of this visualization is to aid interpretation of topics. Topic interpretation tends to be difficult since each topic is defined by a probability distribution with support over many of words. With this interactive visualization, one can focus on the most "relevant" words for any topic by hovering/clicking over the appropriate circle. We will define "relevance" shortly, but for now, go ahead and click on the circle towards the bottom labeled "11".