topic models for synchronic & diachronic corpus exploration

[This article was first published on Jason Timm, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post outlines a fairly simple workflow from annotated corpus to topic model, with a focus on the exploratory utility of topic models. We first consider some text structures relevant to topic modeling in R, and then demonstrate some approaches to visualizing model results, including variation in topic prevalence over time for a diachronic corpus. Lastly, we consider methods for visualizing relatedness among topics.

For demonstration purposes, we utilize a set of corpora made avaiable via the corpusdatr package.

library(tidyverse)
library(topicmodels)
library(tidytext)
library(spacyr)
library(corpusdatr)#devtools::install_github("jaytimm/corpusdatr")
library(corpuslingr)#devtools::install_github("jaytimm/corpuslingr")

Very quickly, topic modeling is an unsupervised text classification methodology in which

  • documents are modeled as composites of topics, and
  • topics are modeled as composites of features/words.

The approach to topic modeling employed here is latent Dirichlet allocation (LDA); to fit our models, we use the LDA function from the topicmodels package (Hornik and Grün 2011).

Synchronic application

In our first example, we investigate topics in the annotated Slate Magazine corpus (ca 1996-2000, 1K texts, 1m words), available as cdr_slate_ann from the corpusdatr package. The corpus has been annotated using the spacyr package, and is functionally a synchronic (or static) corpus by virtue of not containing publication date information.

The first step in building the model is to summarize our corpus; here, this means representing each document in the corpus in terms of its constitutent features and feature frequencies, ie, as a bag-of-words (BOW). To accomplish this task, we use the clr_get_freq function from the corpuslingr package. We set the lemma as our feature unit, and limit feature composition to nouns and entities.

dtm <- corpusdatr::cdr_slate_ann %>%
  spacyr::entity_consolidate() %>%
  filter(tag %in% c("NN", "NNS") | pos =='ENTITY')%>%
  corpuslingr::clr_get_freq(agg_var=c('doc_id','lemma'),
                            toupper=FALSE)%>%
  arrange(doc_id)

Example portion of the corpus-as-BOW:

##   doc_id   lemma txtf docf
## 1      1   noise   14   25
## 2      1     lip    9   24
## 3      1    talk    7  150
## 4      1    walk    6   24
## 5      1     one    5  874
## 6      1 service    5  177

Like most topic modeling functions, topicmodels::LDA requires a document-term matrix (DTM) as input. A DTM is a sparse matrix and, as a data structure, is awkward to work with. Conveniently, the cast_sparse function from the tidytext package allows us to get from corpus-as-BOW to formal DTM quite easily.

Per the output of clr_get_freq, we filter out features with extreme document frequencies (ie, features with limited utility in classification), and then build the DTM with cast_sparse.

static_DTM <- dtm%>%
  filter(docf < 500 & docf > 5)%>%
  tidytext::cast_sparse(row=doc_id,column=lemma,value=txtf)

Which has the following structure:

## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:161352] 0 18 20 104 191 197 229 254 280 291 ...
##   ..@ p       : int [1:5435] 0 25 49 199 223 400 540 596 1001 1067 ...
##   ..@ Dim     : int [1:2] 1000 5434
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:1000] "1" "10" "100" "1000" ...
##   .. ..$ : chr [1:5434] "noise" "lip" "talk" "walk" ...
##   ..@ x       : num [1:161352] 14 1 1 1 1 1 1 1 1 2 ...
##   ..@ factors : list()

Lastly, we fit the model, specifying an eight topic solution:

static_topic <- topicmodels::LDA(static_DTM, 
                                 k = 8, 
                                 control=list(seed=12)) #11

We extract the terms object from the LDA output using the topicmodels::posterior function; output includes the posterior probabilities of the terms for each topic. Focusing on the six highest probability terms per topic, the plot below summarizes model results for the Slate Magazine corpus (ca 1996-2000).

library(ggthemes)
topicmodels::posterior(static_topic)$terms %>%
  data.frame() %>%
  mutate(topics = row.names(.))%>%
  gather(key="term",value="beta", noise:wool) %>%
  group_by(topics) %>%
  top_n(6, beta) %>%
  ungroup() %>%
  arrange(topics, beta)%>%
  mutate(order = row_number(), 
         term=factor(paste(order,term,sep="_"), 
                     levels = paste(order, term, sep = "_")), 
         topics = as.character(topics))%>%
  ggplot(aes(x=term, 
             y=beta, 
             fill=topics)) + 
    geom_col(show.legend = FALSE) +  
    facet_wrap(~topics, scales = "free_y", ncol = 2) +
    scale_x_discrete(labels = function(x) gsub("^.*_", "", x))+
    theme_fivethirtyeight()+ 
    scale_fill_stata() +
    coord_flip()+
    theme(plot.title = element_text(size=14)) +
    labs(title="Topic composition by feature") #

So, some curious times, the close of the 20th century: A lame-duck president in some hot water, presidential primaries, a war abroad. The model seems to paint a fairly clear picture of the socio-political happenings of the time period, and provides a nice macro-vantage from which to view/explore corpus content.

Diachronic application

Next we explore topics in a diachronic corpus, and demonstrate a straightforward approach to visualizing variation in topic prevalence over time. Here we use the cdr_gnews_historical corpus from the corpusdatr package for demonstration purposes.

Corpus and some descriptives

The corpus is comprised of web-based news articles published during a three-week time period (11-27/17 to 12/20/17). Articles were retrieved using my quicknews package, which leverages Goggle News’ RSS feed to direct search, and annotated using the spacyr package.

For the sake of avoiding copyright issues, each constituent article in the corpus has already been reduced to a BOW. The corpus is comprised of ~1,500 texts, ~1.3 million words, and ~200 unique media sources.

Example corpus metadata:

head(cdr_gnews_meta)[1:4]
##   doc_id   pubdates          source
## 1      1 2017-11-27  New York Times
## 2      2 2017-11-27  New York Times
## 3      3 2017-11-27 Washington Post
## 4      4 2017-11-27             CNN
## 5      5 2017-11-27             CNN
## 6      6 2017-11-27 Washington Post
##                                                                      titles
## 1         2 Bosses Show Up to Lead the Consumer Financial Protection Bureau
## 2                                    Meghan Markle Is Going to Make History
## 3 Trump could personally benefit from last-minute change to Senate tax bill
## 4                           Melania Trump unveils White House holiday decor
## 5        Trump's latest conspiracy? The 'Access Hollywood' tape was a fake!
## 6                  Trump attacks media in his first post-Thanksgiving tweet

Some basic corpus descriptives:

cdr_gnews_meta%>%
  group_by(pubdates) %>%
  summarize_at(vars(docN),funs(sum))%>%
  ggplot(aes(x=pubdates, group = 1)) +
  geom_line(aes(y=docN),
            size=1.25, 
            color = 'steelblue') +
  labs(title="Daily corpus size", 
       subtitle = "11-27-17 to 12-20-17")

The corpus is comprised predominantly of articles from the Washington Post, CNN, and the New York Times. It is unclear if these sources generate the most content, or if this is a bias of the news aggregator, or if these sites care less about folks scraping content from their sites.

cdr_gnews_meta %>%
  group_by(source) %>%
  summarize_at(vars(docN),funs(sum))%>%
  top_n(10,docN)%>%
  ggplot(aes(x=reorder(source, docN), y=docN)) + 
  geom_col(width=.65, fill='steelblue') +  
  coord_flip()+
  labs(title="Top ten news sources by text frequency", 
       subtitle = "11-27-17 to 12-20-17")

Topic Model

We follow the same procedure to fitting the topic model as we did previously:

hist_topic <- corpusdatr::cdr_gnews_historical %>%
  filter(tag %in% c("NN", "NNS") | pos =='ENTITY')%>%
  group_by(doc_id,lemma) %>%
  summarize_at(vars(freq),funs(sum))%>%
  tidytext::cast_sparse(row=doc_id,column=lemma,value=freq)%>%
  topicmodels::LDA(., 
                   k = 12, 
                   control = list(verbose = 0, seed=999))

Extract the topic summary of the model:

topic_summary <- data.frame(topicmodels::terms(hist_topic,7)) %>%
  gather(key='topic',value='val',Topic.1:Topic.12) %>%
  group_by(topic)%>%
  summarize (dims = paste(val,collapse=', '))%>%
  mutate(topic = as.numeric(gsub('Topic.','',topic)))%>%
  arrange(topic)

Topics over the three week period:

## # A tibble: 12 x 1
##    dims                                                          
##    <chr>                                                         
##  1 tax, bill, Senate, Republicans, House, vote, rate             
##  2 Moore, Alabama, woman, voter, election, Republican, Trump     
##  3 woman, allegation, harassment, story, time, statement, people 
##  4 film, time, first, one, people, movie, way                    
##  5 North Korea, missile, snow, U.S., weapon, North Korean, report
##  6 company, deal, Fox, market, price, time, Disney               
##  7 EU, deal, government, trade, Yankees, time, Britain           
##  8 fire, wildfire, home, man, photo, Ventura, people             
##  9 Iran, Yemen, CNN, people, coalition, missile, time            
## 10 game, team, time, coach, player, first, NFL                   
## 11 Trump, president, Russia, Trump_'s, official, Mueller, Flynn  
## 12 police, Jerusalem, Israel, officer, death, city, people

So a busy three weeks. The special Senate election in Alabama (and surrounding controversey), wildfires in California, North Korea, the Mueller investigation, tax reform, the #MeToo movement.

Topic prevalence historically

In order to quantify the prevalence of these topics over time, we shift focus from topic composition in terms of words/features to document composition in terms of topics. So, we first extract the posterior probabilities of the topics for each document; then we join corpus metadata and topic summary details.

hist_beta <- topicmodels::posterior(hist_topic)$topics %>%
  data.frame() %>%
  mutate(doc_id = row.names(.))%>%
  arrange(as.numeric(doc_id))%>%
  left_join(cdr_gnews_meta) %>%
  gather(key="topic",value="val",X1:X12) %>%
  mutate(topic = as.numeric(gsub('X','',topic)))%>%
  left_join(topic_summary)

Based on this set of model results, each document in our corpus can be represented as a composite of the sixteen topics summarized above; topic composites for an example set of texts are illustrated in the figure below. Per the figure, text 183 is comprised (in varying degrees) of topics 2, 3, 4, and 9.

hist_beta %>%
  filter(doc_id %in% c('183', '631','896')) %>%
  ggplot(aes(x=reorder(paste(topic,dims, sep=" - "), -topic), 
             y=val,  
             fill = dims)) +
  geom_col(width=.85) +  
  coord_flip()+
  scale_fill_stata() +
  theme(axis.text.x = element_text(angle = 90))+
  xlab ("topic") + ylab("beta") +
  facet_wrap(~doc_id)+
  labs(title="Text by topic") +
  theme(legend.position="none")

Topic prevalence over time, then, is simply the aggregate of these topic probabilities for each document by publication date.

agg_hist_beta <- hist_beta %>%
  group_by(pubdates,topic,dims) %>% 
  summarize_at(vars(val),funs(sum))%>%
  ungroup()

Finally, we plot the results. The size of plot points represents aggregate posterior probabilities, which can be interpreted as the likelihood that some article a written on day d was about some topic z.

The top six words associated with each topic are displayed as well. For a relatively small corpus (comprised of a wide range of content), the plot provides a nice overview of variation in topic prevalence over time.

p <- ggplot(agg_hist_beta) +
     geom_point(aes(x = pubdates, 
                    y = reorder(topic,-topic), 
                    size = val, 
                    color=dims)) +
  theme_fivethirtyeight() + 
  scale_color_stata() 

p + geom_text(data = agg_hist_beta[agg_hist_beta$pubdates == "2017-11-27",], 
     aes(x = pubdates, y = reorder(topic,-topic), label = dims), 
     vjust=-1,
     hjust=0) +
  labs(title="Topic prevalence over time", 
       subtitle="11-27-2017 to 12-20-2017")+
  theme(legend.position="none", 
        plot.title = element_text(size=14))

Topic clusters

Lastly, we consider the relationship among topics in the cdr_gnews_historical corpus via cluster analysis. The first step in this process is to create a correlation matrix of the beta values for constituent topic features.

cor_mat <- data.frame(posterior(hist_topic)$topics)%>%
   `colnames<-`(paste(topic_summary$topic,topic_summary$dims, sep=" - ")) %>%
    cor(.)

Next, we compute the distances (ie, dissimilarities) between topic-pairs, and perform hierarchical clustering analysis on the resulting matrix. We use the ggdendro package to plot results. Per plot below, some intuitive relationships, some less so.

library(ggdendro)
hclust(dist(cor_mat)) %>%
  ggdendrogram(., rotate=TRUE) + 
  theme_fivethirtyeight()

quick summary

A brief outline for quick topic modeling, with some different applications for synchronic and diachronic corpora. For a smarter discussion of underlying assumptions and maths, see Hornik and Grün (2011).

References

Hornik, Kurt, and Bettina Grün. 2011. “Topicmodels: An R Package for Fitting Topic Models.” Journal of Statistical Software 40 (13). American Statistical Association: 1–30.

To leave a comment for the author, please follow the link and comment on their blog: Jason Timm.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)