Thrice: Breaking Down The Lyrics Word-by-Word!

[This article was first published on R by R(yo), and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In Part 2 we will look at the lyrical content of the band, Thrice. By dividing the lyrics of each song into a single-word-per-row format, we can take a much closer look at the the lyrical content at various levels!

Let’s get started!

As always let’s load the various packages we are going to be using!

<span class="c1"># Packages:</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">  </span><span class="c1"># for dplyr, tidyr, ggplot2</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidytext</span><span class="p">)</span><span class="w">   </span><span class="c1"># for separating text into words with unnest_tokens() function</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">stringr</span><span class="p">)</span><span class="w">    </span><span class="c1"># for string detection, extraction, manipulation, etc.</span><span class="w">

</span><span class="n">library</span><span class="p">(</span><span class="n">gplots</span><span class="p">)</span><span class="w">     </span><span class="c1"># for a certain type of plots not in ggplot2</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggrepel</span><span class="p">)</span><span class="w">    </span><span class="c1"># for making sure labels don't overlap</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">scales</span><span class="p">)</span><span class="w">     </span><span class="c1"># for fixing and tweaking the scales on graphs</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">gridExtra</span><span class="p">)</span><span class="w">  </span><span class="c1"># for arranging multiple plots into a single page</span><span class="w">
</span>

First we have to load in the data set that we finished tidying up in Part 1 (not shown here).

Let’s finally take a look at the actual lyrics of Thrice.

<span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">lyrics</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">substr</span><span class="p">(</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">116</span><span class="p">)</span><span class="w">
</span>
## [1] "Image marred by self-infliction <br>  Private wars on my soul waged <br>  Heart is scarred by dual volitions <br>"

Here we are looking at the first few lines of the first song in the first album (Identity Crisis (Live version)). We can see that the lyrics are separated into lines by the
tag. Note that this is how the lines were separated from the source, AZlyrics.com, and may not reflect how it is separated in the album booklets (as you can see, the first two lines shown above are actually one in the booklet).

For the purposes of this analysis and the slight discrepancy in the lines we will first break up the lyrics column into lines to get rid of the
tags and then split that line column so that the data is in the one-word-per-row format. This process is called tokenizing and we use the unnest_tokens() function in the tidytext package for restructuring text data sets!

Using unnest_tokens() we need to: – Enter in the output: the column to be created from tokenizing. – Enter in the input: the column that gets split or tokenized. – Enter in the token: the unit for tokenizing. Default is by “words”.

  • Other inputs and options can be found by looking at the help page: ?unnest_tokens.
<span class="n">library</span><span class="p">(</span><span class="n">stringr</span><span class="p">)</span><span class="w">
</span><span class="c1"># use the stringr for str_split() function to split "lyrics" on the <br> tags!</span><span class="w">

</span><span class="n">wordToken</span><span class="w"> </span><span class="o"><-</span><span class="w">  </span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">unnest_tokens</span><span class="p">(</span><span class="n">output</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">line</span><span class="p">,</span><span class="w"> </span><span class="n">input</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lyrics</span><span class="p">,</span><span class="w"> </span><span class="n">token</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">str_split</span><span class="p">,</span><span class="w"> </span><span class="n">pattern</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">' <br>'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">   
  </span><span class="n">unnest_tokens</span><span class="p">(</span><span class="n">output</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">input</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">line</span><span class="p">)</span><span class="w"> 

</span><span class="n">glimpse</span><span class="p">(</span><span class="n">wordToken</span><span class="p">)</span><span class="w">
</span>
## Observations: 18,757
## Variables: 9
## $ ID       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ album    <fctr> Identity Crisis, Identity Crisis, Identity Crisis, I...
## $ year     <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,...
## $ tracknum <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ title    <chr> "Identity Crisis", "Identity Crisis", "Identity Crisi...
## $ writers  <chr> "Dustin Kensrue", "Dustin Kensrue", "Dustin Kensrue",...
## $ length   <S4: Period> 2M 58S, 2M 58S, 2M 58S, 2M 58S, 2M 58S, 2M 58S...
## $ lengthS  <S4: Period> 178S, 178S, 178S, 178S, 178S, 178S, 178S, 178S...
## $ word     <chr> "image", "marred", "by", "self", "infliction", "priva...

Now we have a data set with all words separated into individual rows.

Therefore, we can count how many times each word appears throughout the lyrics!

<span class="n">countWord</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">wordToken</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">countWord</span><span class="w">  </span><span class="o">%>%</span><span class="w"> </span><span class="n">head</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w">
</span>
## # A tibble: 10 x 2
##     word     n
##    <chr> <int>
##  1   the   864
##  2   and   609
##  3     i   582
##  4    to   400
##  5   you   373
##  6    we   335
##  7     a   333
##  8    of   296
##  9    in   273
## 10    my   251

Just from looking at this it is clear that this isn’t very informative about the content of lyrics. Words such as “I”, “you”, “we”, “very”, “the” aren’t very useful for analyzing the meaningfulness of our data set. These very common set of words are called “stop words”. For example:

<span class="n">data</span><span class="p">(</span><span class="s2">"stop_words"</span><span class="p">)</span><span class="w">

</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">sample_stop</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stop_words</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">sample_n</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w">

</span><span class="n">sample_stop</span><span class="w">
</span>
## # A tibble: 10 x 2
##       word  lexicon
##      <chr>    <chr>
##  1   noone    SMART
##  2   thank    SMART
##  3  hadn't snowball
##  4 several     onix
##  5    into    SMART
##  6    said     onix
##  7   think     onix
##  8   alone     onix
##  9    then snowball
## 10    best    SMART

Using the built-in lexicons (“onix”, “SMART”, and “snowball”) in the tidytext package we can create a new data set where we filter out these “stop words” from our word column in wordToken.

This can be done by using anti_join() function which returns all rows from x (our original wordtoken data set) where there are no matching values in y (stop_words data set) on a variable with a common name across both data sets (word).

<span class="n">wordToken2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">wordToken</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">anti_join</span><span class="p">(</span><span class="n">stop_words</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="c1"># Take out rows of `word` in wordToken that appear in stop_words</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">ID</span><span class="p">)</span><span class="w">               </span><span class="c1"># Can also arrange by track_num, basically the same thing</span><span class="w">

</span><span class="n">countWord2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">wordToken2</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="n">countWord2</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">head</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w">
</span>
## # A tibble: 10 x 2
##     word     n
##    <chr> <int>
##  1  eyes    73
##  2  love    64
##  3 light    53
##  4 blood    43
##  5  life    43
##  6  fall    37
##  7 world    35
##  8  time    32
##  9 heart    31
## 10  hold    30

With “stop words” being filtered out of our data set, “eyes”, “love”, “light”, “blood”, and “life” are the most common! We can make much more inferences about the lyrics from those compared to “I”, “the”, “and”, and “to”!

Now that we have one data set with “stop words” and one without, we can compare them to really emphasize the importance of filtering out “stop words” from any text data:

<span class="c1"># graph of most common words (including stop words) </span><span class="w">
</span><span class="n">one</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">countWord</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">head</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">reorder</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">),</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"darkgreen"</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.75</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ggtitle</span><span class="p">(</span><span class="s2">"Comparison of 'Most Common Words'"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"With 'stop words'"</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Frequency"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pretty_breaks</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">coord_flip</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">panel.grid.major.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_line</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">),</span><span class="w">
        </span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="n">face</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bold"</span><span class="p">),</span><span class="w">
        </span><span class="n">plot.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">))</span><span class="w">

</span><span class="c1"># graph of most common words (no stop words) </span><span class="w">
</span><span class="n">two</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">countWord2</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">head</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">reorder</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">),</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"darkgreen"</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.75</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"No 'stop words'"</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Frequency"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pretty_breaks</span><span class="p">())</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">coord_flip</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">panel.grid.major.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_line</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">),</span><span class="w">
        </span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="n">face</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bold"</span><span class="p">))</span><span class="w">

</span><span class="n">grid.arrange</span><span class="p">(</span><span class="n">one</span><span class="p">,</span><span class="w"> </span><span class="n">two</span><span class="p">)</span><span class="w">
</span>

You can clearly see the difference between the data sets!

The fact that the scales for frequency are very different between the plots shows how individually meaningless “stop words” such as “the”, “and”, “to”, and “a” can really disrupt our analysis. The plot without “stop words” gives us a much clearer idea of the most common and meaningful words in Thrice’s lyrics!

Another way to see this effect is by visualizing our data in a different way, using word clouds!

<span class="n">library</span><span class="p">(</span><span class="n">wordcloud</span><span class="p">)</span><span class="w">
</span><span class="n">layout</span><span class="p">(</span><span class="n">matrix</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">2</span><span class="p">),</span><span class="m">1</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">byrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w">

</span><span class="n">wordcloud</span><span class="p">(</span><span class="n">words</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">countWord</span><span class="o">$</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">freq</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">countWord</span><span class="o">$</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">random.order</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">max.words</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">300</span><span class="p">,</span><span class="w"> 
          </span><span class="n">colors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">brewer.pal</span><span class="p">(</span><span class="m">8</span><span class="p">,</span><span class="w"> </span><span class="s2">"Dark2"</span><span class="p">),</span><span class="w"> </span><span class="n">use.r.layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="n">wordcloud</span><span class="p">(</span><span class="n">countWord2</span><span class="o">$</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">countWord2</span><span class="o">$</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">random.order</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">max.words</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">300</span><span class="p">,</span><span class="w">
          </span><span class="n">colors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">brewer.pal</span><span class="p">(</span><span class="m">8</span><span class="p">,</span><span class="w"> </span><span class="s2">"Dark2"</span><span class="p">),</span><span class="w"> </span><span class="n">use.r.layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span>

With the word cloud visualization, we can really tell how the “stop words” in the left cloud obscures or “crowds out” all of the other more meaningful words due to the sheer amount of “the”s, “you”s, and “to”s that appear in the lyrics text.

Data exploration

Now that we’ve spread out each word into it’s own row, let’s take a closer look at our new data sets!

<span class="n">wordToken</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">title</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">n_distinct</span><span class="p">()</span><span class="w">
</span>
## [1] 100
<span class="n">wordToken2</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">title</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">n_distinct</span><span class="p">()</span><span class="w">
</span>
## [1] 100

Both wordToken and wordToken2 give the number of songs at 100… but wait! In Part 1 we checked that there were a total of 103 songs, in these “tokenized” data sets the instrumental songs were not included simply because as they do not have any words, so there is no row for those instrumentals to exist in these data sets!

<span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">summarise</span><span class="p">(</span><span class="n">num_songs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="c1"># 103 songs in total, as each row = 1 song in original data set</span><span class="w">
</span>
##   num_songs
## 1       103

Let’s look at the exact number of unique words in Thrice’s lyrics. As almost 100% of Thrice’s songs are written by Dustin Kensrue, we’ll be able to see just how extensive his vocabulary is!

<span class="n">wordToken</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">word</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">n_distinct</span><span class="p">()</span><span class="w">
</span>
## [1] 2480

2480! Not bad, let’s take out all the “stop words” though…

<span class="n">wordToken2</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">word</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">n_distinct</span><span class="p">()</span><span class="w">
</span>
## [1] 2095

2095 unique and non-“stop word” words in Thrice’s lyrics! Which also means in wordToken2 we took out around 400 distinct “stop words” out from wordToken.

Lyrics exploration

Now let’s create a new data set called WordsPerSong to create a histogram of the distribution of songs by the number of words (including “stop words”).

<span class="c1"># WordsPerSong</span><span class="w">
</span><span class="n">WordsPerSong</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">wordToken</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">title</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">wordcounts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="o">%>%</span><span class="w">    </span><span class="c1"># each row = 1 word</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">wordcounts</span><span class="p">))</span><span class="w">

</span><span class="n">WordsPerSong</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">wordcounts</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_histogram</span><span class="p">(</span><span class="n">bins</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"white"</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"darkgreen"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_vline</span><span class="p">(</span><span class="n">xintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">median</span><span class="p">(</span><span class="n">WordsPerSong</span><span class="o">$</span><span class="n">wordcounts</span><span class="p">),</span><span class="w"> 
             </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">linetype</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"dashed"</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pretty_breaks</span><span class="p">(),</span><span class="w"> </span><span class="n">expand</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">limits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">12</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pretty_breaks</span><span class="p">(</span><span class="m">10</span><span class="p">),</span><span class="w"> </span><span class="n">expand</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">limits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">410</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">xlab</span><span class="p">(</span><span class="s1">'Total # of Words'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ylab</span><span class="p">(</span><span class="s1">'# of Songs'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Distribution of Songs by Number of Words \n (Dashed red line: median)'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">panel.grid.minor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> 
        </span><span class="n">plot.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">))</span><span class="w">
</span>

The wordToken and wordToken2 data sets unfortunately filters out the instrumentals all together, as the rows for the instrumentals are not created by the unnest_tokens() function. Therefore, the median and mean values for word count will be slightly off in both the wordToken and wordToken2 data sets.

Count the number of songs for each album, we did this in Part 1 with df, this time let’s use the wordToken2 data set that we just created:

<span class="n">wordToken2</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">album</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">num_songs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n_distinct</span><span class="p">(</span><span class="n">title</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">num_songs</span><span class="p">))</span><span class="w">
</span>
## # A tibble: 11 x 2
##                                 album num_songs
##                                <fctr>     <int>
##  1             The Illusion Of Safety        13
##  2        The Artist In The Ambulance        12
##  3                            Vheissu        11
##  4                        Major Minor        11
##  5                    Identity Crisis        10
##  6                            Beggars        10
##  7 To Be Everywhere And To Be Nowhere        10
##  8             The Alchemy Index Fire         6
##  9              The Alchemy Index Air         6
## 10            The Alchemy Index Earth         6
## 11            The Alchemy Index Water         5

Let’s dig deeper, what about the number of words per song? We need to use wordToken instead of df or wordToken2 as “stop_words” should be included for the total word sum.

<span class="n">wordToken</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">album</span><span class="p">,</span><span class="w"> </span><span class="n">word</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">album</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">num_word</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">num_word</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">head</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w">
</span>
## # A tibble: 10 x 3
## # Groups:   title [10]
##                          title                              album num_word
##                          <chr>                             <fctr>    <int>
##  1                  The Weight                            Beggars      383
##  2                 Black Honey To Be Everywhere And To Be Nowhere      365
##  3                     Wake Up To Be Everywhere And To Be Nowhere      297
##  4                   Under Par                    Identity Crisis      293
##  5                Stay With Me To Be Everywhere And To Be Nowhere      292
##  6                The Arsonist             The Alchemy Index Fire      284
##  7          The Sky Is Falling              The Alchemy Index Air      281
##  8           Blood On The Sand To Be Everywhere And To Be Nowhere      280
##  9 The Artist In The Ambulance        The Artist In The Ambulance      279
## 10                    Daedalus              The Alchemy Index Air      273

How about words per album? Let’s also turn this info into a bar graph!

<span class="n">wordToken</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">album</span><span class="p">,</span><span class="w"> </span><span class="n">word</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">album</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">num_word</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">num_word</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">reorder</span><span class="p">(</span><span class="n">album</span><span class="p">,</span><span class="w"> </span><span class="n">num_word</span><span class="p">),</span><span class="w"> </span><span class="n">num_word</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">num_word</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="n">expand</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.01</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_fill_gradient</span><span class="p">(</span><span class="n">low</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"#a1d99b"</span><span class="p">,</span><span class="w"> </span><span class="n">high</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"#006d2c"</span><span class="p">,</span><span class="w"> </span><span class="n">guide</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">coord_flip</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8</span><span class="p">),</span><span class="w"> </span><span class="n">axis.title.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">())</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Number of Words"</span><span class="p">)</span><span class="w">
</span>

Individually, the Alchemy Index albums are the lowest as they each have only six songs each, if they were combined into their actual album sets (Volume 1: Fire & Water, Volume 2: Earth & Air), they would probably have more words than Identity Crisis.

Some more exploration with dplyr verbs!

Let’s use filter() to look at a specific album or specific song.

<span class="n">wordToken</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">album</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"Vheissu"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">num_word</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">())</span><span class="w">
</span>
##   num_word
## 1     2155
<span class="n">wordToken</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">title</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"The Weight"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">num_word</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">())</span><span class="w"> 
</span>
##   num_word
## 1      383

The Weight was the first Thrice song I listened to in my friend’s dorm room back in college, so it has quite a sentimental value to me! So let’s look at the most common words in the lyrics for The Weight!

<span class="n">wordToken2</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">title</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"The Weight"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">title</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">head</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w">
</span>
## # A tibble: 5 x 3
## # Groups:   title [1]
##        title    word     n
##        <chr>   <chr> <int>
## 1 The Weight   won’t    15
## 2 The Weight   leave     9
## 3 The Weight    love     6
## 4 The Weight abandon     4
## 5 The Weight burning     4

From the Top 5 most common words, “won’t”, “leave”, “love”, “abandon”, “burning”, it is clear that this song is about love and commitment. Indeed, the “won’t” in this song is only used in a positive sense, such as “I won’t abandon you” and “I won’t leave you high and dry” reinforcing Dustin’s message that love is a huge commitment; the title of the song, The Weight, actually refers to the gravity and seriousness of that commitment.

We can also combine dplyr with other functions, such as various stringr functions to find specific words! Let’s take a closer look at one of the most common words that we found, “light”, and check the total number of times “light” appears in lyrics of song.

<span class="n">wordToken2</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">str_count</span><span class="p">(</span><span class="s2">"light"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="nf">sum</span><span class="p">()</span><span class="w">
</span>
## [1] 87

We can see that across all the songs in Thrice’s discography, the word “light” shows up 87 times!

Now we use mutate() to create a new column that gives us the number of times the word “light” appears for each song.

<span class="n">wordToken2</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">album</span><span class="p">,</span><span class="w"> </span><span class="n">word</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">light</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">str_count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="s2">"light"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">album</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">total_light</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">light</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">total_light</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">head</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w">
</span>
## # A tibble: 5 x 3
## # Groups:   title [5]
##                              title                              album
##                              <chr>                             <fctr>
## 1 Between The End And Where We Lie                            Vheissu
## 2                        Music Box                            Vheissu
## 3      The Artist In The Ambulance        The Artist In The Ambulance
## 4                       The Window To Be Everywhere And To Be Nowhere
## 5      A Song For Milly Michaelson              The Alchemy Index Air
## # ... with 1 more variables: total_light <int>

Let’s look at the proportion of “light” out of all the words in Thrice’s lyrics!

<span class="n">wordToken</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">album</span><span class="p">,</span><span class="w"> </span><span class="n">word</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">light</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">str_count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="s2">"light"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="nf">sum</span><span class="p">(),</span><span class="w">
            </span><span class="n">num_word</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w"> 
            </span><span class="n">prop_light</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="n">light</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">num_word</span><span class="p">))</span><span class="w">
</span>
##   light num_word  prop_light
## 1    87    18757 0.004638268

Even one of the most common words, “light”, accounts for only 0.46% of all the words in the lyrics of Thrice’s songs!

What about the most frequent word in a specific song (with and without “stop words”)?

<span class="n">wordToken</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">title</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">head</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w">
</span>
## # A tibble: 10 x 3
## # Groups:   title [9]
##                     title  word     n
##                     <chr> <chr> <int>
##  1            Black Honey     i    48
##  2 Image Of The Invisible   the    36
##  3        All That's Left    we    35
##  4 Image Of The Invisible    we    27
##  5           Yellow Belly   you    25
##  6                Wake Up    we    24
##  7               Promises    we    23
##  8             The Weight     i    22
##  9                Blinded     i    21
## 10          In Your Hands     i    21

The most common words seem to mainly be personal nouns along with “the”.

<span class="n">wordToken2</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">title</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">head</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w">
</span>
## # A tibble: 10 x 3
## # Groups:   title [8]
##                               title      word     n
##                               <chr>     <chr> <int>
##  1                      Black Honey      i’ll    20
##  2                          Wake Up     gotta    17
##  3                          Wake Up      wake    17
##  4                     Yellow Belly     don’t    16
##  5                       The Weight     won’t    15
##  6           Image Of The Invisible     image    13
##  7           Image Of The Invisible invisible    13
##  8          The Earth Isn't Humming      fall    13
##  9                Blood On The Sand      sick    12
## 10 Between The End And Where We Lie  daylight    11

“I’ll” and “I” appears the most in both lists from the song, Black Honey, a very political song that is an allegory for the meddling foreign policy of the United States. The constant appearance of “I”, “I’ll”, “I’ve” throughout the song highlights the very selfish, arrogant, and oblivious nature of the protagonist, who is aggressively seeking to obtain the “black honey”, referring to the petroleum of Middle Eastern countries.

<span class="n">wordToken2</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">title</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"Black Honey"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">head</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w">
</span>
## # A tibble: 10 x 2
##          word     n
##         <chr> <int>
##  1       i’ll    20
##  2       bees     5
##  3       hand     5
##  4      swarm     5
##  5   swinging     5
##  6       till     5
##  7    they’re     4
##  8       time     4
##  9 understand     4
## 10      don’t     3

In second place for this song is “bees”. In this song the the “bees” and “hornets”, symbolize the inhabitants of the Middle East countries that are trampled in the protagonist’s pursuit for the “black honey”. It’s a really great song (have a listen here), my second favorite off the album after Hurricane.

Back to the overall word count, the appearance of “image” and “invisible” from the song Image of the Invisible is more straightforward as it is shouted out during the chorus repeatedly. Most of that song is Thrice screaming that title phrase out actually!

From looking at this data, a thing to consider is that the data can be skewed toward repeated phrases in a song, like the chorus! From other lyrics analysis I’ve seen, people have tried to find lyrics that don’t have repeated choruses, however, most lyrics websites aren’t well moderated or have a ton of different people with different input styles posting lyrics of different songs for a single artist so it can be a bit tricky in this regard.

Creating nested data frames for storing plots for each album.

Now let’s try to create plots for the most frequent words for each album. To do this we need to create a “nested” data set. Basically, the “data” column will contain the specific list of the most common words for each individual album (row).

<span class="c1"># most frequent unigrams per album: ####</span><span class="w">

</span><span class="n">word_count_nested</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">wordToken2</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">album</span><span class="p">,</span><span class="w"> </span><span class="n">word</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">count</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">top_n</span><span class="p">(</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">wt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">count</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">album</span><span class="p">,</span><span class="w"> </span><span class="n">desc</span><span class="p">(</span><span class="n">count</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">nest</span><span class="p">()</span><span class="w">  
</span>

Let’s take a look at the individual elements of our new “data” column!

<span class="n">word_count_nested</span><span class="o">$</span><span class="n">data</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span>
## # A tibble: 8 x 3
##    word count  sort
##   <chr> <int> <lgl>
## 1 heart    12  TRUE
## 2  eyes     8  TRUE
## 3  life     6  TRUE
## 4 light     6  TRUE
## 5   cry     5  TRUE
## 6 faith     5  TRUE
## 7  soul     5  TRUE
## 8  true     5  TRUE
<span class="n">word_count_nested</span><span class="o">$</span><span class="n">data</span><span class="p">[[</span><span class="m">5</span><span class="p">]]</span><span class="w">
</span>
## # A tibble: 5 x 3
##    word count  sort
##   <chr> <int> <lgl>
## 1  free    15  TRUE
## 2  burn    13  TRUE
## 3  send    11  TRUE
## 4  fire    10  TRUE
## 5 flame     9  TRUE

The most common word for the first list (Album = Identity Crisis) is “heart” and the fifth list (Album = AI: Fire) is “free”. The only problem with the top_n() function is that if there are ties than the total number will be bigger than the specified n such as in Identity Crisis above.

Now we use the data to create a plot for each album using the map2() function which allows us to iteratively create a plot from each specific data column from each album row and stores the plot information in its own column plot, just like we did in data.

<span class="n">word_count_nested</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">word_count_nested</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">plot</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">map2</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="n">album</span><span class="p">,</span><span class="w"> 
                     </span><span class="o">~</span><span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">.x</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">count</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
           </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">reorder</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">count</span><span class="p">),</span><span class="w"> </span><span class="n">count</span><span class="p">),</span><span class="w"> 
                    </span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.65</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
           </span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pretty_breaks</span><span class="p">(</span><span class="m">10</span><span class="p">),</span><span class="w"> </span><span class="n">limits</span><span class="w"> </span><span class="o">=</span><...

To leave a comment for the author, please follow the link and comment on their blog: R by R(yo).

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)