LDA: 4321

[This article was first published on Code and Stats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Auster4321

Table of Contents

  1. Libraries
  2. Introduction
  3. Data
  4. LDA on Chapters
  5. Mistaken words
  6. References

Libraries

These libraries will be needed to conduct the research.

library(topicmodels)
library(plyr)
library(tidyverse)
library(tidytext)
library(dplyr) 
library(stringr)
library(tidyr)
library(scales)
library(tibble)

Note: The idea (and some of the code) originates from Silge and Robinson’s book on Text Mining with R. Following this blog post, there may emerge some Error’s, most notably with the creation of the LDA model (LDA_VEM). Solutions have been addressed here and here.

Introduction

In this blog post, we review the Latent Dirichlet Allocation (LDA) model for its ability to identify different storylines in a single book. LDA is a Topic Model in the field of Natural Language Processing (NLP), Machine Learning (ML). LDA is a generative process that is often used to find structures (topics) in an unlabelled text corpus.

For example, Blei (2012) identified topics from 17,000 science articles, using an LDA model. Silge and Robinson (2020) used the LDA model to assign unlabelled chapters from four well-known books to their respective storyline. Our experiment is based on the same idea. We test if unlabelled chapters can be assigned to their respective plot lines. However, instead of using different books, we applied the idea to the novel 4 3 2 1 by the American writer Paul Auster.

The defining characteristic of the novel is its distinct narrative style. The protagonist Archie Ferguson is at the center of the plot. However, instead of a linear narrative style, the author created four versions of the main character. Over the course of the book, the four storylines evolve and diverge in parallel. The storylines vary in the protagonist’s choices and the strokes of fate he suffers. Yet, there are consistent elements and recurring characters that exist across all storylines. Thus, the four storylines are fundamentally different, but in many details, they are one.

Data

The data set is the text corpus of the novel, written in the English language. It contains over 350,000 words across four storylines, divided into multiple chapters. In total, we consider 22 chapters, which are unequally distributed among the four storylines. The first and the last storyline both stretch over seven chapters, while the other stories are cut short. The number of chapters per storyline is a result of the main character’s decisions.

The data set is available as a text corpus in a txt-file. First, we import the text.

auster = readLines("4 3 2 1_ A Novel - Paul Auster.txt")

Then we need to find the beginning of a chapter, for instance of the first.

which(auster == "1.1")

## [1]  167 6710

The first number is the chapter number position (the chapter beginning) in the data in R. The second number is the position in the Appendix, therefore not relevant. Adapting the code we can automatically find each chapter’s beginning throughout the book.

part1 = c()
for(i in 1.1:7.1){part1 = c(part1,which(as.character(i) == auster)[1])}

part2 = c()
for(i in 1.2:7.2){part2 = c(part2,which(as.character(i) == auster)[1])}

part3 = c()
for(i in 1.3:7.3){part3 = c(part3,which(as.character(i) == auster)[1])}

part4 = c()
for(i in 1.4:7.4){part4 = c(part4,which(as.character(i) == auster)[1])}

We connect all index positions into one vector.

parts = c(part1, part2, part3, part4)
parts = c(parts[seq(1,length(parts),7)], 
          parts[seq(2,length(parts),7)], 
          parts[seq(3,length(parts),7)], 
          parts[seq(4,length(parts),7)],
          parts[seq(5,length(parts),7)],
          parts[seq(6,length(parts),7)],
          parts[seq(7,length(parts),7)],
          which(auster == "ALSO BY PAUL AUSTER"))

Then we divide it into its chapters and prepare the text for further analysis. The first of several data pre-processing steps is tokenization. Here, each chapter is divided into individual words. The words are converted to lower case and the punctuation is removed. Other common pre-processing steps in NLP are lemmatization and stemming, where words are reduced to their word stem. We decided against these algorithmic processes, as information gets lost and the differences between the topics are subtle.

df = tibble(document = as.character(0), word = as.character(0)) # Initialize
#df1 = tibble(document = as.character("First"))
#df2 = tibble(word = as.character("First"))
for (j in 1:4){
  for(i in 1:length(part1)){
    story1_1 = auster[parts[seq(j, length(parts),4)][i]:(parts[seq(j+1,length(parts),4)][i]-1)]
    story1_1 = sapply(story1_1, paste0, collapse="")
    story1_1 = paste(story1_1, collapse = " ")
    story1_1 = gsub(paste0("[", as.character(i),'.', "1", "2", "3", "4",",", "(", ")", "*", ":", "]" ), "", story1_1)
    story1_1_words = strsplit(story1_1, split = " ")
    df = df %>% add_row(document = paste0("Storyline-", j, "_Chapter-", i), word = unlist(story1_1_words))
    #df1 = df1 %>% add_row(document = paste0("Storyline-", j, "_Chapter-", i))
    #df2 = df2 %>% add_row(word = unlist(story1_1_words))
  }
}
#df = bind_rows(df1, df2)
df = df[-1,]
df$word = tolower(df$word)

Additionally we change “ferguson’s” to “ferguson” and remove empty strings.

df[df$word=="ferguson’s",]$word = "ferguson"
df = df[-which(df$word == ""),]

In the following step, so-called stop words are removed. In the English language exemplary stop words are “and”, “the”, “a”, “in”. They generate noise in statistical models, as they carry hardly any information. Also, they inflate the text corpus and thus reduce the speed with which the model is calculated.

word_counts = df %>% anti_join(stop_words) %>% count(document, word, sort = TRUE)

Eventually we may proof-check whether the chapters are correctly ordered.

for (i in 1:7){print(df$word[head(which(df$document==paste0("Storyline-1_Chapter-",i)))])}
for (i in 1:7){print(df$word[head(which(df$document==paste0("Storyline-2_Chapter-",i)))])}
for (i in 1:7){print(df$word[head(which(df$document==paste0("Storyline-3_Chapter-",i)))])}
for (i in 1:7){print(df$word[head(which(df$document==paste0("Storyline-4_Chapter-",i)))])}

LDA on Chapters

LDA is one of the best-known methods of Topic Modeling. Topic Modeling describes a model that detects structures (“topics”) in a finite set of documents. Typically, documents are a set of words, while topics are a “distribution over a fixed vocabulary” (Blei, 2012). Theoretically, LDA and Topic Models can also be applied in areas outside of Natural Language Processing.

According to Silge and Robinson (2020), there are two basic ideas behind LDA. First, several topics can refer to the same document. Thus, a document can theoretically consist of k topics, where k is the set of all topics. These topics describe the document to a varying degree. Second, a set of specific words form a topic. These words can be unique and belong to only one topic. At the same time, there may be words that describe several topics. Thus, we have two probabilities that our model estimates. A term-topic probability, which indicates the probability that a word can be assigned to a certain topic. As well as a topic-document probability, which indicates which topic best describes a document.

In our example, k is the number of storylines, as well as the number of topics. Applying the two ideas described above to our experiment, we can identify two challenges. First, creating a unique, specific mixture of topics per document, in a way that the document is best described by the prevailing topic. Second, identifying terms in unique topics, that highlight the subtle differences between the storylines.

Using the package topicmodels in R, we create a four topic LDA model. Additionally the data word_counts is in a tidy form, but has to be changed into a DocumentTermMatrix, in order to be used in the topicmodels package.

chapters_dtm = word_counts %>%
    cast_dtm(document, word, n)

Now we create a four-topic model using LDA(). We are using four topics, as there are four different stories told.

chapters_lda = LDA(chapters_dtm, k = 4, control = list(seed = 1234))
print(chapters_lda)

## A LDA_VEM topic model with 4 topics.

We can take a look at the per-topic-per-word probabilities. These are the probabilities that a certain term (word) will occur in each of the four topics.

chapter_topics = tidy(chapters_lda, matrix = "beta")

This code examines the betas for the most prevailing term “ferguson”.

chapter_topics[which(chapter_topics$term == "ferguson"),]

## # A tibble: 4 x 3
##   topic term       beta
##         
## 1     1 ferguson 0.0265
## 2     2 ferguson 0.0276
## 3     3 ferguson 0.0265
## 4     4 ferguson 0.0229

We see that the probability is high and evenly distributed over the topics. This is in alignment with our expectations, since Ferguson is the main character across all plot lines. However, if we look for other characters, such as Vivian, we can see that she cannot be of equal importance.

chapter_topics[which(chapter_topics$term == "vivian"),]

## # A tibble: 4 x 3
##   topic term        beta
##          
## 1     1 vivian 1.76e-152
## 2     2 vivian 3.24e-  3
## 3     3 vivian 7.87e-111
## 4     4 vivian 1.30e- 20

It can be assumed that she is only relevant in one storyline. It should be noted that the second topic does not necessarily refer to the second storyline.

Now we can use dplyr’s top_n() function, to find the top n terms within each topic.

top_terms = chapter_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

We can easily visualize the top n words using ggplot.

library(ggplot2)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered() 

The figure shows that Ferguson is the defining character across all topics and the most prominent term. Further coherences can be drawn, such as the parents (“mother” and “father”) are among the most important terms across most topics. At the same time, however, we can also work out differences. For example, it can be seen that “columbia” (may be the university) plays a greater role in topic four. Furthermore, we can identify characters that occur exclusively in individual topics or are overrepresented, such as Vivian, Gil, Celia, and Noah.

We can systematically work out these topical differences, by taking the ratio of the corresponding betas. The following code programms in dplyr, which can be learned more about here.

log2ratio = function(first_topic, second_topic, topics_chapter){
  
  # Create the Beta Spread Table
  beta_spread = topics_chapter %>%
    mutate(topic = paste0("topic", topic)) %>%
    spread(topic, beta) %>%
    filter(!!sym(first_topic) > .001 | !!sym(second_topic) > .001) %>%
    mutate(log_ratio = log2(!!sym(first_topic) / !!sym(second_topic)))
  
  # Modify the Data
  top_n_beta_spreads = head(beta_spread[c('term','log_ratio')][order(-beta_spread[c('term', 'log_ratio')]$log_ratio),],n=5)

  tail_n_beta_spreads = tail(beta_spread[c('term', 'log_ratio')][order(-beta_spread[c('term', 'log_ratio')]$log_ratio),],n=5)
  
  top_beta_spread = top_n_beta_spreads %>% 
    add_row(term = tail_n_beta_spreads$term, log_ratio = tail_n_beta_spreads$log_ratio)
  top_beta_spread = top_beta_spread[order(top_beta_spread$log_ratio),]
  
  # Create the Plot
  par(mar=c(4,8,3,2)+.1)
  barplot(pull(top_beta_spread), horiz = TRUE, names.arg = top_beta_spread$term,las = 1, axes = FALSE,
        main = paste(as.character(first_topic), "/", as.character(second_topic)))
  axis(1,at=seq(round(min(top_beta_spread$log_ratio)),round(max(top_beta_spread$log_ratio)),50))
}

par(mfrow=c(3,2), cex=1.5)
log2ratio("topic1", "topic2", chapter_topics)
log2ratio("topic1", "topic3", chapter_topics)
log2ratio("topic1", "topic4", chapter_topics)
log2ratio("topic2", "topic3", chapter_topics)
log2ratio("topic2", "topic4", chapter_topics)
log2ratio("topic3", "topic4", chapter_topics)

We can see clear differences between the topics. For instance, topic2 appears to have a character aubrey that only plays a role in that particular storyline. Additionally, there are many more characters and distinctions we can make out due to the beta_spread. These differences help us as a reader, but also the model, to distinguish between different storylines.

So far we have considered each topic individually, trying to see the bigger picture by comparing all topics simultaneously. However, we may also take a look at one topic and its difference to all others. We do so by calculating the logarithmic ratio of one topic and the mean of the three others.

beta_spreads = function(first_topic, second_topic, 
                        third_topic, fourth_topic, topics_chapter){
  topics_chapter %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(!!sym(first_topic) > .001 | !!sym(second_topic) > .001 | !!sym(third_topic) > .001 | !!sym(fourth_topic) > .001)
}

beta_spread = beta_spreads("topic1", "topic2", "topic3", "topic4", chapter_topics)
beta_spread$log_avg1 = log2(beta_spread$topic1 / rowMeans(beta_spread[,3:5]))
beta_spread$log_avg2 = log2(beta_spread$topic2 / rowMeans(beta_spread[,c(2,4,5)]))
beta_spread$log_avg3 = log2(beta_spread$topic3 / rowMeans(beta_spread[,c(2,3,5)]))
beta_spread$log_avg4 = log2(beta_spread$topic4 / rowMeans(beta_spread[,2:4]))

plot_avg_spreads = function(df_spreads, topic){
  head_spreads = head(df_spreads[, c(1, 5+topic)][order(-df_spreads[, c(1, 5+topic)][,2]),],n=5)
  tail_spreads = tail(df_spreads[, c(1, 5+topic)][order(-df_spreads[, c(1, 5+topic)][,2]),],n=5)
  top_beta_spread = rbind(head_spreads, tail_spreads)
  top_beta_spread = top_beta_spread %>% map_df(rev)
  
  #par(mar=c(4,8,3,2)+.1)
  barplot(pull(top_beta_spread), horiz = TRUE, names.arg = top_beta_spread$term,las = 1, axes = FALSE,
          main = paste("Topic", topic, "/ Mean of Topic", toString(setdiff(c(1,2,3,4),topic))))
  axis(1,at=seq(round(min(top_beta_spread[,2])),round(max(top_beta_spread[,2])),50))
}

par(mfrow = c(2, 2), oma = c(1, 1, 1, 1), mar = c(5, 5, 5, 1), cex=1.5) 
plot_avg_spreads(beta_spread, topic = 1)
plot_avg_spreads(beta_spread, topic = 2)
plot_avg_spreads(beta_spread, topic = 3)
plot_avg_spreads(beta_spread, topic = 4)

Using this graphic we can clearly see which terms make up a topic. For example, we see that Evie and Celia are characters from topic 1, while Aubrey and Vivian seem to be characters from topic 2. By comparing one topic with all the others simultaneously (and direclty), we can draw conclusions about the uniqueness of individual terms (words) in topics.

After examining the varying terms and their betas in depth, we may take a closer look at the per-document classification. How well does the LDA model assign the chapters to their respective storyline?

Per-document classification

To each chapter there are four per-document-per-topic probabilities (topic 1, 2, 3 and 4).

chapters_gamma = tidy(chapters_lda, matrix = "gamma")

Thus, we can see the probabilities with which the model assigns each chapter to a topic.

chapters_gamma[chapters_gamma$document == "Storyline-1_Chapter-1",]

## # A tibble: 4 x 3
##   document              topic      gamma
##                          
## 1 Storyline-1_Chapter-1     1 0.00000651
## 2 Storyline-1_Chapter-1     2 0.854     
## 3 Storyline-1_Chapter-1     3 0.146     
## 4 Storyline-1_Chapter-1     4 0.00000651

The model believes the first chapter of the first storyline belongs to the second topic. We may seperate the chapters from the storylines.

chapters_gamma = chapters_gamma %>%
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE)

And then illustrate the per-document-per-topic probability for all chapters.

chapters_gamma %>%
  mutate(title = reorder(title, gamma * topic)) %>%
  ggplot(aes(factor(topic), gamma)) +
  geom_boxplot() +
  facet_wrap(~ title)

From the plot we can see that Storyline-2, Storyline-3 and Storyline-4 are classified well, Storyline-1, however, is ambiguous. To get a better understanding why the first storyline is harder to be classified, we may take a closer look at the classified chapters. chapter_classifications gives away the probabilities with which the model assigns each chapter to a topic.

chapter_classifications = chapters_gamma %>%
  group_by(title, chapter) %>%
  top_n(1, gamma) %>%
  ungroup() %>%
  arrange(gamma)

book_topics = chapter_classifications %>%
  count(title, topic) %>%
  group_by(title) %>%
  top_n(1, n) %>%
  ungroup() %>%
  transmute(consensus = title, topic)

After examining the chapters and their classification, we may also take a look at the terms (words) and their classification. This may be done using augment from the package broom.

assignments = augment(chapters_lda, data = chapters_dtm)

assignments = assignments %>%
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
  inner_join(book_topics, by = c(".topic" = "topic"))

Adding the consensus (books assigned to) to the assignments tibble lets us examine the relationship between words and their misclassification to other chapters. It may be visualized for clarification using ggplot.

assignments %>%
  count(title, consensus, wt = count) %>%
  group_by(title) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(consensus, title, fill = percent)) +
  geom_tile() +
  scale_fill_gradient2(high = "red", label = percent_format()) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.grid = element_blank()) +
  labs(x = "Book words were assigned to",
       y = "Book words came from",
       fill = "% of assignments")

We can see that all words that came from Storyline-3 and Storyline-2 were correctly classified to their storyline. Words from Storyline-1 and Storyline-4 were mainly correctly classified, but carried some minsinterpretation as well.

So far we have given the model the whole chapter for classification. But we may ask, how many words (or pages) does the model need to accurately classify the chapter?

The following code takes n words (stepSize), creates an LDA model and subsequently classifies the sub-chapter. This is done for all chapters across the book.

Paragraph2Topic = function(lda_obj=chapters_lda , df, doc="Storyline-1_Chapter-1", start=100, end = 1600){
  df2 = subset(df, document == doc)#paste0("Storyline-",storyline,"_Chapter-",chapter)
  df2 = df2[start:end,]
  
  word_counts2 = df2 %>% anti_join(stop_words) %>% count(document, word, sort = TRUE) 
  
  
  chapters_dtm2 = word_counts2 %>%
    cast_dtm(document, word, n) 
  
  lda_inf <- posterior(lda_obj, chapters_dtm2) 
  return(lda_inf)
}

all_chapters = function(seed = 123, stepSize = 1000){
  
  set.seed(seed) # Set Seed
  
  docs = unique(df$document) # List of 22 Chapters
  chapter_lengths = table(df$document) # Lengths of Chapters
  chapter_lengths = round(chapter_lengths, -3)-1000 # Rounds the chapters_lengths down, so the last iteration does not overshoot the length
  nmbrRows = sum(chapter_lengths/stepSize) # To pre-allocate the space, the number of rows must be known
  df_chapters = data.frame(matrix(ncol = 5, nrow = nmbrRows)) # Initialize and pre-allocate the data.frame
  df_chapters = df_chapters %>% rename(Chapter = X1, StepSize = X2, Window = X3, Classified = X4, Correct = X5) # Rename the data.frame
  iteration = 1 # Counter
  
  for (j in 1:length(chapter_lengths)) { # Outer loop: Loops over the Chapters 
    d = docs[j] # Define the doc
    sequ = seq(1,chapter_lengths[j], stepSize) # Creates a sequence, given the length of the respective chapter
    for (i in 1:length(sequ)) { # Inner loop: Loop through the single chapers
      s = sequ[i] # Takes the given starting point
      lda_inf=suppressMessages(Paragraph2Topic(chapters_lda ,df,doc=d,start=s, end = s+stepSize)) # Suppresses the "Join by = " message
  
      df_chapters$Chapter[iteration] = d # Assign the chapter name
      df_chapters$StepSize[iteration] = stepSize # Assign the step size
      df_chapters$Window[iteration] = i # Assign the window number
      df_chapters$Classified[iteration] = which.max(lda_inf$topics) # Classifications 
      df_chapters$Correct[iteration] = book_topics$topic[as.integer(substr(docs[j], 11,11))] # Corrects
      
      iteration = iteration + 1 # Increase the counter
    }
  }
  return(df_chapters)
}

We may call the function using a stepSize of 100. That is, we only provide the model with 100 words and challenge it to correctly classify it. The second line gives us the Accuracy, the correctly classified chapters, divided by all.

df_chapters = all_chapters(stepSize = 100)
sum(df_chapters$Classified == df_chapters$Correct) / length(df_chapters$Classified)

## [1] 0.746988

In order to create the confusion matrix, we need to adjust the data.frame.

# Adapt the data.frame to make it compatible with the Confusion Matrix code
df_chapters_CM = df_chapters %>%
  separate(Chapter, c("Title", "Chapter"), sep = "_", convert = TRUE) %>%
  inner_join(book_topics, by = c("Classified" = "topic"))

Now we can take the slightly adjusted code from above, to create a confusion matrix.

# Confusion Matrix
df_chapters_CM %>%
  count(Title, consensus) %>%
  group_by(Title) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(consensus, Title, fill = percent)) +
  geom_tile() +
  scale_fill_gradient2(high = "red", label = percent_format()) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.grid = element_blank()) +
  labs(x = "Correct Chapters",
       y = "Assigned Chapters",
       fill = "% of assignments")

We can see that all chapters that came from Storyline-2 and Storyline-3 were correctly classified to their storyline. Chapters from Storyline-1 and Storyline-4 were mainly correctly classified, but carried some minsinterpretation as well. Thus, we may ask what these misinterpretations were.

Mistaken words

By comparing the title (true chapter) with the consensus (classified chapter) we can spot differences.

mistakes = assignments %>%
  filter(title != consensus)

mistakes = mistakes %>%
    count(title, consensus, term, wt = count) %>%
    ungroup() %>%
    arrange(desc(n))

mistakes

## # A tibble: 14,714 x 4
##    title       consensus   term         n
##                      
##  1 Storyline-1 Storyline-2 ferguson   278
##  2 Storyline-4 Storyline-1 ferguson   221
##  3 Storyline-1 Storyline-3 ferguson   190
##  4 Storyline-4 Storyline-2 ferguson   164
##  5 Storyline-1 Storyline-2 amy         95
##  6 Storyline-1 Storyline-3 mother      80
##  7 Storyline-1 Storyline-2 mother      71
##  8 Storyline-4 Storyline-2 mother      66
##  9 Storyline-1 Storyline-2 father      64
## 10 Storyline-4 Storyline-1 father      62
## # … with 14,704 more rows

The main character ferguson was the character most often misclassified, which does not come as a surprise, as he spans across all storylines. Other often wrongly classified characters are the mother and father, as well as amy.

Note: You may find this Blog Post on R-Bloggers.com as well.

References

David M. Blei. Probabilistic topic models. Communications of the ACM, Vol. 55 No. 4, Pages 77-84, 2012.

Julia Silge and David Robinson. Text Mining with R, O’Reilly, 2020.

To leave a comment for the author, please follow the link and comment on their blog: Code and Stats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)