Sentiment Analysis of The Lord Of The Rings with tidytext

[This article was first published on Jakub Glinka's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

You got me thinking about Watson and its unprecedented flexibility in analyzing different data sources (at least according to IBM). So how difficult it would be to analyse sentiment of one of my favorites books using R? Pretty easy actually – all thanks to new package tidytext by Julia Silge and David Robinson…

The tidy text format

Tidy text format is defined as a table with one-term-per-row. In short it is just tidy textual data which enables to use tidyverse tools to clean and transform data. More about conceptual underpinnings creation of tidytext package can be found in the online book:

http://tidytextmining.com/tidytext.html

Parsing data to the tibble format

I parsed one of the websites that offers LOTR trilogy online. After some pretty basic operations on html source I extracted full text along with chapter names and book parts and book titles:

## # A tibble: 6 × 4
##                        book   part               chapter
##                       <chr>  <chr>                <fctr>
## 1 1. Fellowship of the Ring Book I A Long-expected Party
## 2 1. Fellowship of the Ring Book I A Long-expected Party
## 3 1. Fellowship of the Ring Book I A Long-expected Party
## 4 1. Fellowship of the Ring Book I A Long-expected Party
## 5 1. Fellowship of the Ring Book I A Long-expected Party
## 6 1. Fellowship of the Ring Book I A Long-expected Party
## # ... with 1 more variables: text <chr>

with field text simply containing lines of text from the books:

## # A tibble: 6 × 1
##                                                    text
##                                                   <chr>
## 1 When Mr. Bilbo Baggins of Bag End announced that h...
## 2 celebrating his eleventy-first birthday with a par...
## 3    there was much talk and excitement in Hobbiton....
## 4 Bilbo was very rich and very peculiar, and had bee...
## 5 for sixty years, ever since his remarkable disappe...
## 6 The riches he had brought back from his travels ha...

Sentiment analysis

One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. Since LOTR is naturally divided into chapters we can apply sentiment analysis to them and plot their sentiment scores.

The three general-purpose lexicons available in tidytext package are

  • AFINN from Finn Årup Nielsen,
  • bing from Bing Liu and collaborators, and
  • nrc from Saif Mohammad and Peter Turney.

I will use Bing lexicon which is simply a tibble with words and positive and negative words:

<span class="n">get_sentiments</span><span class="p">(</span><span class="s2">"bing"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">head</span><span class="w">
</span>
## # A tibble: 6 × 2
##         word sentiment
##        <chr>     <chr>
## 1    2-faced  negative
## 2    2-faces  negative
## 3         a+  positive
## 4   abnormal  negative
## 5    abolish  negative
## 6 abominable  negative

and this is how you run sentiment analysis tidytext way:

<span class="n">lotr</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="c1"># split text into words
</span><span class="w">  </span><span class="n">unnest_tokens</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">text</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="c1"># remove stop words
</span><span class="w">  </span><span class="n">anti_join</span><span class="p">(</span><span class="n">stop_words</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"word"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="c1"># add sentiment scores to words
</span><span class="w">  </span><span class="n">left_join</span><span class="p">(</span><span class="n">get_sentiments</span><span class="p">(</span><span class="s2">"bing"</span><span class="p">),</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"word"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="c1"># count number of negative and positive words
</span><span class="w">  </span><span class="n">count</span><span class="p">(</span><span class="n">chapter</span><span class="p">,</span><span class="w"> </span><span class="n">book</span><span class="p">,</span><span class="w"> </span><span class="n">sentiment</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">spread</span><span class="p">(</span><span class="n">key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sentiment</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ungroup</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="c1"># create centered score
</span><span class="w">  </span><span class="n">mutate</span><span class="p">(</span><span class="n">sentiment</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">positive</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">negative</span><span class="w"> </span><span class="o">-</span><span class="w"> 
           </span><span class="n">mean</span><span class="p">(</span><span class="n">positive</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">negative</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="n">book</span><span class="p">,</span><span class="w"> </span><span class="n">chapter</span><span class="p">,</span><span class="w"> </span><span class="n">sentiment</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="c1"># reorder chapter levels
</span><span class="w">  </span><span class="n">mutate</span><span class="p">(</span><span class="n">chapter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="nf">as.character</span><span class="p">(</span><span class="n">chapter</span><span class="p">),</span><span class="w"> 
                </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">levels</span><span class="p">(</span><span class="n">chapter</span><span class="p">)[</span><span class="m">61</span><span class="o">:</span><span class="m">1</span><span class="p">]))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="c1"># plot
</span><span class="w">  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">chapter</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sentiment</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">book</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">theme_classic</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">angle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">90</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">coord_flip</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">ylim</span><span class="p">(</span><span class="m">-250</span><span class="p">,</span><span class="w"> </span><span class="m">250</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ggtitle</span><span class="p">(</span><span class="s2">"Centered sentiment scores"</span><span class="p">,</span><span class="w"> 
          </span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"for LOTR chapters"</span><span class="p">)</span><span class="w">
</span>

plot of chunk unnamed-chunk-4

It’s pretty neat if you ask me.


Code for this post can be found here:
github

To leave a comment for the author, please follow the link and comment on their blog: Jakub Glinka's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)