Sentiment Analysis of The Lord Of The Rings with tidytext

March 1, 2017
By

(This article was first published on Jakub Glinka's Blog, and kindly contributed to R-bloggers)

You got me thinking about Watson and its unprecedented flexibility in analyzing different data sources (at least according to IBM). So how difficult it would be to analyse sentiment of one of my favorites books using R? Pretty easy actually – all thanks to new package tidytext by Julia Silge and David Robinson…

The tidy text format

Tidy text format is defined as a table with one-term-per-row. In short it is just tidy textual data which enables to use tidyverse tools to clean and transform data. More about conceptual underpinnings creation of tidytext package can be found in the online book:

http://tidytextmining.com/tidytext.html

Parsing data to the tibble format

I parsed one of the websites that offers LOTR trilogy online. After some pretty basic operations on html source I extracted full text along with chapter names and book parts and book titles:

## # A tibble: 6 × 4
##                        book   part               chapter
##                                         
## 1 1. Fellowship of the Ring Book I A Long-expected Party
## 2 1. Fellowship of the Ring Book I A Long-expected Party
## 3 1. Fellowship of the Ring Book I A Long-expected Party
## 4 1. Fellowship of the Ring Book I A Long-expected Party
## 5 1. Fellowship of the Ring Book I A Long-expected Party
## 6 1. Fellowship of the Ring Book I A Long-expected Party
## # ... with 1 more variables: text 

with field text simply containing lines of text from the books:

## # A tibble: 6 × 1
##                                                    text
##                                                   
## 1 When Mr. Bilbo Baggins of Bag End announced that h...
## 2 celebrating his eleventy-first birthday with a par...
## 3    there was much talk and excitement in Hobbiton....
## 4 Bilbo was very rich and very peculiar, and had bee...
## 5 for sixty years, ever since his remarkable disappe...
## 6 The riches he had brought back from his travels ha...

Sentiment analysis

One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. Since LOTR is naturally divided into chapters we can apply sentiment analysis to them and plot their sentiment scores.

The three general-purpose lexicons available in tidytext package are

  • AFINN from Finn Årup Nielsen,
  • bing from Bing Liu and collaborators, and
  • nrc from Saif Mohammad and Peter Turney.

I will use Bing lexicon which is simply a tibble with words and positive and negative words:

get_sentiments("bing") %>% head
## # A tibble: 6 × 2
##         word sentiment
##             
## 1    2-faced  negative
## 2    2-faces  negative
## 3         a+  positive
## 4   abnormal  negative
## 5    abolish  negative
## 6 abominable  negative

and this is how you run sentiment analysis tidytext way:

lotr %>%
  # split text into words
  unnest_tokens(word, text) %>%
  # remove stop words
  anti_join(stop_words, by = "word") %>%
  # add sentiment scores to words
  left_join(get_sentiments("bing"), by = "word") %>%
  # count number of negative and positive words
  count(chapter, book, sentiment) %>%
  spread(key = sentiment, value = n) %>%
  ungroup %>%
  # create centered score
  mutate(sentiment = positive - negative - 
           mean(positive - negative)) %>%
  select(book, chapter, sentiment) %>%
  # reorder chapter levels
  mutate(chapter = factor(as.character(chapter), 
                levels = levels(chapter)[61:1])) %>%
  # plot
  ggplot(aes(x = chapter, y = sentiment)) + 
  geom_bar(stat = "identity", aes(fill = book)) + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  coord_flip() + 
  ylim(-250, 250) +
  ggtitle("Centered sentiment scores", 
          subtitle = "for LOTR chapters")

plot of chunk unnamed-chunk-4

It’s pretty neat if you ask me.


Code for this post can be found here:
github

To leave a comment for the author, please follow the link and comment on their blog: Jakub Glinka's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)