Thrice: Breaking Down The Lyrics Word-by-Word!

Posted on October 9, 2017 by R by R(yo) in R bloggers | 0 Comments

[This article was first published on R by R(yo), and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In Part 2 we will look at the lyrical content of the band, Thrice. By dividing the lyrics of each song into a single-word-per-row format, we can take a much closer look at the the lyrical content at various levels!

Let’s get started!

As always let’s load the various packages we are going to be using!

# Packages:
library(tidyverse)  # for dplyr, tidyr, ggplot2
library(tidytext)   # for separating text into words with unnest_tokens() function
library(stringr)    # for string detection, extraction, manipulation, etc.

library(gplots)     # for a certain type of plots not in ggplot2
library(ggrepel)    # for making sure labels don't overlap
library(scales)     # for fixing and tweaking the scales on graphs
library(gridExtra)  # for arranging multiple plots into a single page

First we have to load in the data set that we finished tidying up in Part 1 (not shown here).

Let’s finally take a look at the actual lyrics of Thrice.

df %>% 
  select(lyrics) %>% 
  substr(4, 116)

## [1] "Image marred by self-infliction <br>  Private wars on my soul waged <br>  Heart is scarred by dual volitions <br>"

Here we are looking at the first few lines of the first song in the first album (Identity Crisis (Live version)). We can see that the lyrics are separated into lines by the
tag. Note that this is how the lines were separated from the source, AZlyrics.com, and may not reflect how it is separated in the album booklets (as you can see, the first two lines shown above are actually one in the booklet).

For the purposes of this analysis and the slight discrepancy in the lines we will first break up the lyrics column into lines to get rid of the <br> tags and then split that line column so that the data is in the one-word-per-row format. This process is called tokenizing and we use the unnest_tokens() function in the tidytext package for restructuring text data sets!

Using unnest_tokens() we need to: – Enter in the output: the column to be created from tokenizing. – Enter in the input: the column that gets split or tokenized. – Enter in the token: the unit for tokenizing. Default is by “words”.

Other inputs and options can be found by looking at the help page: ?unnest_tokens.

library(stringr)
# use the stringr for str_split() function to split "lyrics" on the <br> tags!

wordToken <-  df %>%
  unnest_tokens(output = line, input = lyrics, token = str_split, pattern = ' <br>') %>%   
  unnest_tokens(output = word, input = line) 

glimpse(wordToken)

## Observations: 18,757
## Variables: 9
## $ ID       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ album    <fctr> Identity Crisis, Identity Crisis, Identity Crisis, I...
## $ year     <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,...
## $ tracknum <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ title    <chr> "Identity Crisis", "Identity Crisis", "Identity Crisi...
## $ writers  <chr> "Dustin Kensrue", "Dustin Kensrue", "Dustin Kensrue",...
## $ length   <S4: Period> 2M 58S, 2M 58S, 2M 58S, 2M 58S, 2M 58S, 2M 58S...
## $ lengthS  <S4: Period> 178S, 178S, 178S, 178S, 178S, 178S, 178S, 178S...
## $ word     <chr> "image", "marred", "by", "self", "infliction", "priva...

Now we have a data set with all words separated into individual rows.

Therefore, we can count how many times each word appears throughout the lyrics!

countWord <- wordToken %>% count(word, sort=TRUE)
countWord  %>% head(10)

## # A tibble: 10 x 2
##     word     n
##    <chr> <int>
##  1   the   864
##  2   and   609
##  3     i   582
##  4    to   400
##  5   you   373
##  6    we   335
##  7     a   333
##  8    of   296
##  9    in   273
## 10    my   251

Just from looking at this it is clear that this isn’t very informative about the content of lyrics. Words such as “I”, “you”, “we”, “very”, “the” aren’t very useful for analyzing the meaningfulness of our data set. These very common set of words are called “stop words”. For example:

data("stop_words")

set.seed(1)
sample_stop <- stop_words %>% sample_n(10)

sample_stop

## # A tibble: 10 x 2
##       word  lexicon
##      <chr>    <chr>
##  1   noone    SMART
##  2   thank    SMART
##  3  hadn't snowball
##  4 several     onix
##  5    into    SMART
##  6    said     onix
##  7   think     onix
##  8   alone     onix
##  9    then snowball
## 10    best    SMART

Using the built-in lexicons (“onix”, “SMART”, and “snowball”) in the tidytext package we can create a new data set where we filter out these “stop words” from our word column in wordToken.

This can be done by using anti_join() function which returns all rows from x (our original wordtoken data set) where there are no matching values in y (stop_words data set) on a variable with a common name across both data sets (word).

wordToken2 <- wordToken %>% 
  anti_join(stop_words) %>% # Take out rows of `word` in wordToken that appear in stop_words
  arrange(ID)               # Can also arrange by track_num, basically the same thing

countWord2 <- wordToken2 %>% count(word, sort = TRUE)

countWord2 %>% head(10)

## # A tibble: 10 x 2
##     word     n
##    <chr> <int>
##  1  eyes    73
##  2  love    64
##  3 light    53
##  4 blood    43
##  5  life    43
##  6  fall    37
##  7 world    35
##  8  time    32
##  9 heart    31
## 10  hold    30

With “stop words” being filtered out of our data set, “eyes”, “love”, “light”, “blood”, and “life” are the most common! We can make much more inferences about the lyrics from those compared to “I”, “the”, “and”, and “to”!

Now that we have one data set with “stop words” and one without, we can compare them to really emphasize the importance of filtering out “stop words” from any text data:

# graph of most common words (including stop words) 
one <- countWord %>% head(10) %>% 
  ggplot(aes(reorder(word, n), n)) + 
  geom_bar(stat = "identity", fill = "darkgreen", alpha = 0.75) +
  ggtitle("Comparison of 'Most Common Words'") +
  labs(x = "With 'stop words'", y = "Frequency") +
  scale_y_continuous(breaks = pretty_breaks(5)) +
  coord_flip() +
  theme_bw() +
  theme(panel.grid.major.x = element_line(size = 1.25),
        axis.text.x = element_text(size = 12, face = "bold"),
        plot.title = element_text(hjust = 0.5))

# graph of most common words (no stop words) 
two <- countWord2 %>% head(10) %>% 
  ggplot(aes(reorder(word, n), n)) + 
  geom_bar(stat = "identity", fill = "darkgreen", alpha = 0.75) +
  labs(x = "No 'stop words'", y = "Frequency") +
  scale_y_continuous(breaks = pretty_breaks()) +
  coord_flip() +
  theme_bw() +
  theme(panel.grid.major.x = element_line(size = 1.25),
        axis.text.x = element_text(size = 12, face = "bold"))

grid.arrange(one, two)

You can clearly see the difference between the data sets!

The fact that the scales for frequency are very different between the plots shows how individually meaningless “stop words” such as “the”, “and”, “to”, and “a” can really disrupt our analysis. The plot without “stop words” gives us a much clearer idea of the most common and meaningful words in Thrice’s lyrics!

Another way to see this effect is by visualizing our data in a different way, using word clouds!

library(wordcloud)
layout(matrix(c(1,2),1,2, byrow = TRUE))

wordcloud(words = countWord$word, freq = countWord$n, random.order = FALSE, max.words = 300, 
          colors = brewer.pal(8, "Dark2"), use.r.layout = TRUE)

wordcloud(countWord2$word, countWord2$n, random.order = FALSE, max.words = 300,
          colors = brewer.pal(8, "Dark2"), use.r.layout = TRUE)

With the word cloud visualization, we can really tell how the “stop words” in the left cloud obscures or “crowds out” all of the other more meaningful words due to the sheer amount of “the”s, “you”s, and “to”s that appear in the lyrics text.

Data exploration

Now that we’ve spread out each word into it’s own row, let’s take a closer look at our new data sets!

wordToken %>% select(title) %>% n_distinct()

## [1] 100

wordToken2 %>% select(title) %>% n_distinct()

## [1] 100

Both wordToken and wordToken2 give the number of songs at 100… but wait! In Part 1 we checked that there were a total of 103 songs, in these “tokenized” data sets the instrumental songs were not included simply because as they do not have any words, so there is no row for those instrumentals to exist in these data sets!

df %>% summarise(num_songs = n()) # 103 songs in total, as each row = 1 song in original data set

##   num_songs
## 1       103

Let’s look at the exact number of unique words in Thrice’s lyrics. As almost 100% of Thrice’s songs are written by Dustin Kensrue, we’ll be able to see just how extensive his vocabulary is!

wordToken %>% select(word) %>% n_distinct()

## [1] 2480

2480! Not bad, let’s take out all the “stop words” though…

wordToken2 %>% select(word) %>% n_distinct()

## [1] 2095

2095 unique and non-“stop word” words in Thrice’s lyrics! Which also means in wordToken2 we took out around 400 distinct “stop words” out from wordToken.

Lyrics exploration

Now let’s create a new data set called WordsPerSong to create a histogram of the distribution of songs by the number of words (including “stop words”).

# WordsPerSong
WordsPerSong <- wordToken %>% 
  group_by(title) %>% 
  summarize(wordcounts = n()) %>%    # each row = 1 word
  arrange(desc(wordcounts))

WordsPerSong %>% 
  ggplot(aes(x = wordcounts)) + 
  geom_histogram(bins = 50, color = "white", fill = "darkgreen") +
  geom_vline(xintercept = median(WordsPerSong$wordcounts), 
             color = "red", linetype = "dashed", size = 1.25) +
  scale_y_continuous(breaks = pretty_breaks(), expand = c(0, 0), limits = c(0, 12)) +
  scale_x_continuous(breaks = pretty_breaks(10), expand = c(0, 0), limits = c(0, 410)) +
  xlab('Total # of Words') +
  ylab('# of Songs') +
  labs(title = 'Distribution of Songs by Number of Words \n (Dashed red line: median)') + 
  theme_bw() +
  theme(panel.grid.minor = element_blank(), 
        plot.title = element_text(hjust = 0.5))

The wordToken and wordToken2 data sets unfortunately filters out the instrumentals all together, as the rows for the instrumentals are not created by the unnest_tokens() function. Therefore, the median and mean values for word count will be slightly off in both the wordToken and wordToken2 data sets.

Count the number of songs for each album, we did this in Part 1 with df, this time let’s use the wordToken2 data set that we just created:

wordToken2 %>% 
  group_by(album) %>% 
  summarize(num_songs = n_distinct(title)) %>% 
  arrange(desc(num_songs))

## # A tibble: 11 x 2
##                                 album num_songs
##                                <fctr>     <int>
##  1             The Illusion Of Safety        13
##  2        The Artist In The Ambulance        12
##  3                            Vheissu        11
##  4                        Major Minor        11
##  5                    Identity Crisis        10
##  6                            Beggars        10
##  7 To Be Everywhere And To Be Nowhere        10
##  8             The Alchemy Index Fire         6
##  9              The Alchemy Index Air         6
## 10            The Alchemy Index Earth         6
## 11            The Alchemy Index Water         5

Let’s dig deeper, what about the number of words per song? We need to use wordToken instead of df or wordToken2 as “stop_words” should be included for the total word sum.

wordToken %>% 
  select(title, album, word) %>% 
  group_by(title, album) %>% 
  summarize(num_word = n()) %>% 
  arrange(desc(num_word)) %>%
  head(10)

## # A tibble: 10 x 3
## # Groups:   title [10]
##                          title                              album num_word
##                          <chr>                             <fctr>    <int>
##  1                  The Weight                            Beggars      383
##  2                 Black Honey To Be Everywhere And To Be Nowhere      365
##  3                     Wake Up To Be Everywhere And To Be Nowhere      297
##  4                   Under Par                    Identity Crisis      293
##  5                Stay With Me To Be Everywhere And To Be Nowhere      292
##  6                The Arsonist             The Alchemy Index Fire      284
##  7          The Sky Is Falling              The Alchemy Index Air      281
##  8           Blood On The Sand To Be Everywhere And To Be Nowhere      280
##  9 The Artist In The Ambulance        The Artist In The Ambulance      279
## 10                    Daedalus              The Alchemy Index Air      273

How about words per album? Let’s also turn this info into a bar graph!

wordToken %>% 
  select(album, word) %>% 
  group_by(album) %>% 
  summarize(num_word = n()) %>% 
  arrange(desc(num_word)) %>% 
  ggplot(aes(reorder(album, num_word), num_word, fill = num_word)) + 
  geom_bar(stat = "identity") + 
  scale_y_continuous(expand = c(0.01, 0)) +
  scale_fill_gradient(low = "#a1d99b", high = "#006d2c", guide = FALSE) +
  coord_flip() +
  theme_bw() +
  theme(axis.text.y = element_text(size = 8), axis.title.y = element_blank()) +
  ylab("Number of Words")

Individually, the Alchemy Index albums are the lowest as they each have only six songs each, if they were combined into their actual album sets (Volume 1: Fire & Water, Volume 2: Earth & Air), they would probably have more words than Identity Crisis.

Some more exploration with `dplyr` verbs!

Let’s use filter() to look at a specific album or specific song.

wordToken %>% 
  filter(album == "Vheissu") %>% 
  summarize(num_word = n())

##   num_word
## 1     2155

wordToken %>% 
  filter(title == "The Weight") %>% 
  summarize(num_word = n()) 

##   num_word
## 1      383

The Weight was the first Thrice song I listened to in my friend’s dorm room back in college, so it has quite a sentimental value to me! So let’s look at the most common words in the lyrics for The Weight!

wordToken2 %>% 
  filter(title == "The Weight") %>% 
  group_by(title) %>% 
  count(word) %>% 
  arrange(desc(n)) %>% 
  head(5)

## # A tibble: 5 x 3
## # Groups:   title [1]
##        title    word     n
##        <chr>   <chr> <int>
## 1 The Weight   wont    15
## 2 The Weight   leave     9
## 3 The Weight    love     6
## 4 The Weight abandon     4
## 5 The Weight burning     4

From the Top 5 most common words, “won’t”, “leave”, “love”, “abandon”, “burning”, it is clear that this song is about love and commitment. Indeed, the “won’t” in this song is only used in a positive sense, such as “I won’t abandon you” and “I won’t leave you high and dry” reinforcing Dustin’s message that love is a huge commitment; the title of the song, The Weight, actually refers to the gravity and seriousness of that commitment.

We can also combine dplyr with other functions, such as various stringr functions to find specific words! Let’s take a closer look at one of the most common words that we found, “light”, and check the total number of times “light” appears in lyrics of song.

wordToken2 %>% 
  str_count("light") %>% 
  sum()

## [1] 87

We can see that across all the songs in Thrice’s discography, the word “light” shows up 87 times!

Now we use mutate() to create a new column that gives us the number of times the word “light” appears for each song.

wordToken2 %>% 
  select(title, album, word) %>%
  mutate(light = str_count(word, "light")) %>% 
  group_by(title, album) %>% 
  summarize(total_light = sum(light)) %>% 
  arrange(desc(total_light)) %>% 
  head(5)

## # A tibble: 5 x 3
## # Groups:   title [5]
##                              title                              album
##                              <chr>                             <fctr>
## 1 Between The End And Where We Lie                            Vheissu
## 2                        Music Box                            Vheissu
## 3      The Artist In The Ambulance        The Artist In The Ambulance
## 4                       The Window To Be Everywhere And To Be Nowhere
## 5      A Song For Milly Michaelson              The Alchemy Index Air
## # ... with 1 more variables: total_light <int>

Let’s look at the proportion of “light” out of all the words in Thrice’s lyrics!

wordToken %>% 
  select(title, album, word) %>% 
  summarize(light = str_count(word, "light") %>% sum(),
            num_word = n(), 
            prop_light = (light / num_word))

##   light num_word  prop_light
## 1    87    18757 0.004638268

Even one of the most common words, “light”, accounts for only 0.46% of all the words in the lyrics of Thrice’s songs!

What about the most frequent word in a specific song (with and without “stop words”)?

wordToken %>% 
  group_by(title) %>% 
  count(word) %>% 
  arrange(desc(n)) %>% 
  head(10)

## # A tibble: 10 x 3
## # Groups:   title [9]
##                     title  word     n
##                     <chr> <chr> <int>
##  1            Black Honey     i    48
##  2 Image Of The Invisible   the    36
##  3        All That's Left    we    35
##  4 Image Of The Invisible    we    27
##  5           Yellow Belly   you    25
##  6                Wake Up    we    24
##  7               Promises    we    23
##  8             The Weight     i    22
##  9                Blinded     i    21
## 10          In Your Hands     i    21

The most common words seem to mainly be personal nouns along with “the”.

wordToken2 %>% 
  group_by(title) %>% 
  count(word, sort = TRUE) %>% 
  head(10)

## # A tibble: 10 x 3
## # Groups:   title [8]
##                               title      word     n
##                               <chr>     <chr> <int>
##  1                      Black Honey      ill    20
##  2                          Wake Up     gotta    17
##  3                          Wake Up      wake    17
##  4                     Yellow Belly     dont    16
##  5                       The Weight     wont    15
##  6           Image Of The Invisible     image    13
##  7           Image Of The Invisible invisible    13
##  8          The Earth Isn't Humming      fall    13
##  9                Blood On The Sand      sick    12
## 10 Between The End And Where We Lie  daylight    11

“I’ll” and “I” appears the most in both lists from the song, Black Honey, a very political song that is an allegory for the meddling foreign policy of the United States. The constant appearance of “I”, “I’ll”, “I’ve” throughout the song highlights the very selfish, arrogant, and oblivious nature of the protagonist, who is aggressively seeking to obtain the “black honey”, referring to the petroleum of Middle Eastern countries.

wordToken2 %>% 
  filter(title == "Black Honey") %>% 
  count(word, sort = TRUE) %>% 
  head(10)

## # A tibble: 10 x 2
##          word     n
##         <chr> <int>
##  1       ill    20
##  2       bees     5
##  3       hand     5
##  4      swarm     5
##  5   swinging     5
##  6       till     5
##  7    theyre     4
##  8       time     4
##  9 understand     4
## 10      dont     3

In second place for this song is “bees”. In this song the the “bees” and “hornets”, symbolize the inhabitants of the Middle East countries that are trampled in the protagonist’s pursuit for the “black honey”. It’s a really great song (have a listen here), my second favorite off the album after Hurricane.

Back to the overall word count, the appearance of “image” and “invisible” from the song Image of the Invisible is more straightforward as it is shouted out during the chorus repeatedly. Most of that song is Thrice screaming that title phrase out actually!

From looking at this data, a thing to consider is that the data can be skewed toward repeated phrases in a song, like the chorus! From other lyrics analysis I’ve seen, people have tried to find lyrics that don’t have repeated choruses, however, most lyrics websites aren’t well moderated or have a ton of different people with different input styles posting lyrics of different songs for a single artist so it can be a bit tricky in this regard.

Creating nested data frames for storing plots for each album.

Now let’s try to create plots for the most frequent words for each album. To do this we need to create a “nested” data set. Basically, the “data” column will contain the specific list of the most common words for each individual album (row).

# most frequent unigrams per album: ####

word_count_nested <- wordToken2 %>% 
  group_by(album, word) %>% 
  summarize(count = n(), sort = TRUE) %>% 
  top_n(5, wt = count) %>% 
  arrange(album, desc(count)) %>% 
  nest()

Let’s take a look at the individual elements of our new “data” column!

word_count_nested$data[[1]]

## # A tibble: 8 x 3
##    word count  sort
##   <chr> <int> <lgl>
## 1 heart    12  TRUE
## 2  eyes     8  TRUE
## 3  life     6  TRUE
## 4 light     6  TRUE
## 5   cry     5  TRUE
## 6 faith     5  TRUE
## 7  soul     5  TRUE
## 8  true     5  TRUE

word_count_nested$data[[5]]

## # A tibble: 5 x 3
##    word count  sort
##   <chr> <int> <lgl>
## 1  free    15  TRUE
## 2  burn    13  TRUE
## 3  send    11  TRUE
## 4  fire    10  TRUE
## 5 flame     9  TRUE

The most common word for the first list (Album = Identity Crisis) is “heart” and the fifth list (Album = AI: Fire) is “free”. The only problem with the top_n() function is that if there are ties than the total number will be bigger than the specified n such as in Identity Crisis above.

Now we use the data to create a plot for each album using the map2() function which allows us to iteratively create a plot from each specific data column from each album row and stores the plot information in its own column plot, just like we did in data.

word_count_nested <- word_count_nested %>% 
  mutate(plot = map2(data, album, 
                     ~ggplot(data = .x, aes(fill = count)) +
           geom_bar(aes(reorder(word, count), count), 
                    stat = "identity", width = 0.65) +
           scale_y_continuous(breaks = pretty_breaks(10), limits = c(0, 22), expand = c(0, 0)) +
           scale_fill_gradient(low = "#a1d99b", high = "#006d2c", guide = FALSE) +
           ggtitle(.y) +
           labs(x = NULL, y = NULL) +
           coord_flip() +  
           theme_bw() +
           theme(axis.text.y = element_text(size = 7),
                 title = element_text(size = 10))
           ))

str(word_count_nested, list.len = 3, max.level = 2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    11 obs. of  3 variables:
##  $ album: Factor w/ 11 levels "Identity Crisis",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ data :List of 11
##   ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 8 obs. of  3 variables:
##   ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 7 obs. of  3 variables:
##   ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of  3 variables:
##   .. [list output truncated]
##  $ plot :List of 11
##   ..$ :List of 9
##   .. ..- attr(*, "class")= chr  "gg" "ggplot"
##   ..$ :List of 9
##   .. ..- attr(*, "class")= chr  "gg" "ggplot"
##   ..$ :List of 9
##   .. ..- attr(*, "class")= chr  "gg" "ggplot"
##   .. [list output truncated]

On inspection, the word_count_nested data frame consists of three columns of album, data, and plot by 11 rows, one row for each album. The column data is a series of lists that holds the Top 10 or so words for each album (row). For example, the first element of data is a data frame of eight observations of two variables, specifically the eight most common words in the first album as rows with word and count as the two column variables. The next column, plot, is a series of lists that holds the plot information (the ggplot2 code we added into the data frame with mutate()) for each album (row).

By selecting the specific element within the list, we can extract the plot for a certain album

word_count_nested$plot[[2]]

word_count_nested$plot[[11]]

With everything organized in a “tidy” way, let’s try to plot for all of the albums!

First let’s try with something we used in Part 1, facetting!

Before we start plotting we need to “unnest” the information inside the data column to create our facetted plot. Then we can use our regular ggplot commands to create our facetted plot along with facet_grid().

word_count_nested %>% 
  unnest(data) %>%                   # take data out from list
  ggplot(aes(x = word, y = count)) +
  geom_bar(stat = "identity") +
  facet_grid(.~album)

Regardless of the fact that we don’t have enough screen space, facetting is clearly not the way to do this. One way to solve our problem is to code in a way that each plot for each album is printed out individually and then to arrange all those individual plots onto one page. This way the group of plots won’t be forcibly squished together into one gigantic plot.

One way is to use a base R method with the do.call() function. This will iterate the grid.arrange() function for the ggplot data stored in plot in every row/album.

do.call(grid.arrange, c(word_count_nested$plot, ncol = 3))

Another way is to save all the plotting data in the plot column into a single list… To get an output slightly different from the above, for this list of plots let’s make a subset of the 1st to 4th plots (Albums: Identity Crisis to The Artist In The Ambulance) instead.

nested_plots <- word_count_nested$plot[1:4]

str(nested_plots, list.len = 2, max.level = 2)

## List of 4
##  $ :List of 9
##   ..$ data       :Classes 'tbl_df', 'tbl' and 'data.frame':  8 obs. of  3 variables:
##   ..$ layers     :List of 1
##   .. [list output truncated]
##   ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
##  $ :List of 9
##   ..$ data       :Classes 'tbl_df', 'tbl' and 'data.frame':  7 obs. of  3 variables:
##   ..$ layers     :List of 1
##   .. [list output truncated]
##   ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
##   [list output truncated]

From inspecting the list with str(), we can see that this is a list with a length of four, basically one list for each of the four albums that we subsetted. Within each album’s list we have another set of lists for the respective data and plot elements! Using this list of lists we can pass it through the plot_grid() function from the cowplot package to arrange multiple plots on a single page. In this function we basically call our list of plots with the plotlist = argument and then we can also specify the number of columns, rows, label size, etc.

library(cowplot)

plot_grid(plotlist = nested_plots, ncol = 2)

We can now view all the plots (or a subset of them) on a single page!

By using the map2() function from purrr package, this time we apply the function ggsave() so that it iteratively saves the plot for each album!

map2(paste0(word_count_nested$album, ".pdf"), word_count_nested$plot, ggsave)

We can check that the code ran properly (without having to manually look into your working directory) with the file.exists() function.

file.exists(paste0(word_count_nested$album, ".pdf"))

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

With all those plots properly saved into separate files we can now share and send them to other people!

Today we did divided up the lyrics into singular words and analyzed it at various levels and through various filters. In Part 3 we will look more closely at the different sentiments/emotions that are expressed in Thrice’s lyrics!

(Link: Part 3)

(10/30/2017 EDIT: forgot the wt = argument in top_n() when nesting the dataframe, each nested data column should have the proper number of rows now!)

To leave a comment for the author, please follow the link and comment on their blog: R by R(yo).

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Thrice: Breaking Down The Lyrics Word-by-Word!

Data exploration

Lyrics exploration

Some more exploration with `dplyr` verbs!

Creating nested data frames for storing plots for each album.

Related

Data exploration

Lyrics exploration

Some more exploration with dplyr verbs!

Creating nested data frames for storing plots for each album.

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Some more exploration with `dplyr` verbs!

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)