[This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Motivation

After reading The Life Changing Magic of Tidying Text and A tidy text analysis of Rick and Morty I thought about doing something similar but reproducible and focused on Rick and Morty.

In this post I’ll focus on the Tidy Data principles. However, here is the Github repo with the scripts to scrap the transcripts and subtitles of Rick and Morty.

Here I’m using the subtitles of the TV show, as some of the transcripts I could scrap were incomplete.

Note: If some images appear too small on your screen you can open them in a new tab to show them in their original size.

# Let’s scrap

The subtools package returns a data frame after reading srt files. In addition to that resulting data frame I wanted to explicitly point the season and chapter of each line of the subtitles. To do that I had to scrap the subtitles and then use str_replace_all. To follow the steps clone the repo from Github:

git clone https://github.com/pachamaltese/rick_and_morty_tidy_text


# Rick and Morty Can Be So Tidy

After reading the tidy file I created after scraping the subtitles, I use unnest_tokens to divide the subtitles in words. This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, sentences, lines, paragraphs, or separation around a regex pattern.

if (!require("pacman")) install.packages("pacman")

rick_and_morty_subs_tidy = rick_and_morty_subs %>%
unnest_tokens(word,text) %>%
anti_join(stop_words)


The data is in one-word-per-row format, and we can manipulate it with tidy tools like dplyr. For example, in the last chunk I used an anti_join to remove words such a “a”, “an” or “the”.

Then we can use count to find the most common words in all of Rick and Morty episodes as a whole.

rick_and_morty_subs_tidy %>%
count(word, sort = TRUE)

# A tibble: 8,100 x 2
word     n
<chr> <int>
1  morty  1842
2   rick  1625
3  jerry   621
4   yeah   484
5  gonna   421
6    hey   391
7 summer   389
8     uh   331
9   time   319
10   beth   295
# ... with 8,090 more rows


Sentiment analysis can be done as an inner join. Three sentiment lexicons are in the tidytext package in the sentiment dataset. Let’s examine how sentiment changes changes during each novel. Let’s find a sentiment score for each word using the Bing lexicon, then count the number of positive and negative words in defined sections of each novel.

bing = sentiments %>%
filter(lexicon == "bing") %>%
select(-score)

bing

# A tibble: 6,788 x 3
word sentiment lexicon
<chr>     <chr>   <chr>
1     2-faced  negative    bing
2     2-faces  negative    bing
3          a+  positive    bing
4    abnormal  negative    bing
5     abolish  negative    bing
6  abominable  negative    bing
7  abominably  negative    bing
8   abominate  negative    bing
9 abomination  negative    bing
10       abort  negative    bing
# ... with 6,778 more rows

rick_and_morty_sentiment = rick_and_morty_subs_tidy %>%
inner_join(bing) %>%
count(episode_name, index = linenumber %/% 50, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
left_join(rick_and_morty_subs_tidy[,c("episode_name","season","episode")] %>% distinct()) %>%
arrange(season,episode) %>%
mutate(episode_name = paste(season,episode,"-",episode_name),
season = factor(season, labels = c("Season 1", "Season 2", "Season 3"))) %>%
select(episode_name, season, everything(), -episode)

rick_and_morty_sentiment

# A tibble: 431 x 6
episode_name   season index negative positive sentiment
<chr>   <fctr> <dbl>    <dbl>    <dbl>     <dbl>
1 S01 E01 - Pilot Season 1     0        6        3        -3
2 S01 E01 - Pilot Season 1     1       10        0       -10
3 S01 E01 - Pilot Season 1     2        3        1        -2
4 S01 E01 - Pilot Season 1     3       10        4        -6
5 S01 E01 - Pilot Season 1     4        2        5         3
6 S01 E01 - Pilot Season 1     5        8        4        -4
7 S01 E01 - Pilot Season 1     6        6        1        -5
8 S01 E01 - Pilot Season 1     7        7        4        -3
9 S01 E01 - Pilot Season 1     8       14        5        -9
10 S01 E01 - Pilot Season 1     9        3        2        -1
# ... with 421 more rows


Now we can plot these sentiment scores across the plot trajectory of each novel. In the second plot I’m just showing Dan Harmon’s favourite episodes provided to the moment the show has 31 episodes in total.

ggplot(rick_and_morty_sentiment, aes(index, sentiment, fill = season)) +
geom_bar(stat = "identity", show.legend = FALSE) +
facet_wrap(~season, nrow = 3, scales = "free_x", dir = "v") +
theme_minimal(base_size = 13) +
labs(title = "Sentiment in Rick and Morty",
y = "Sentiment") +
scale_fill_viridis(end = 0.75, discrete=TRUE) +
scale_x_discrete(expand=c(0.02,0)) +
theme(strip.text=element_text(hjust=0)) +
theme(strip.text = element_text(face = "italic")) +
theme(axis.title.x=element_blank()) +
theme(axis.ticks.x=element_blank()) +
theme(axis.text.x=element_blank())


rick_and_morty_sentiment_favourites = rick_and_morty_sentiment %>%
filter(grepl("S03 E03|S03 E07|S01 E06|S02 E03|S02 E07", episode_name))

ggplot(rick_and_morty_sentiment_favourites, aes(index, sentiment, fill = season)) +
geom_bar(stat = "identity", show.legend = FALSE) +
facet_wrap(~episode_name, ncol = 5, scales = "free_x", dir = "h") +
theme_minimal(base_size = 13) +
labs(title = "Sentiment in Rick and Morty\n(Creator's favourite episodes)",
y = "Sentiment") +
scale_fill_viridis(end = 0.75, discrete=TRUE) +
scale_x_discrete(expand=c(0.02,0)) +
theme(strip.text=element_text(hjust=0)) +
theme(strip.text = element_text(face = "italic")) +
theme(axis.title.x=element_blank()) +
theme(axis.ticks.x=element_blank()) +
theme(axis.text.x=element_blank())


# Looking at Units Beyond Words

Lots of useful work can be done by tokenizing at the word level, but sometimes it is useful or necessary to look at different units of text. For example, some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole. These algorithms try to understand that I am not having a good day is a negative sentence, not a positive one, because of negation.

rick_and_morty_sentences = rick_and_morty_subs %>%
group_by(season) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
ungroup()


Let’s look at just one.

rick_and_morty_sentences\$sentence[50]

[1] "is gonna be really liberating."


We can use tidy text analysis to ask questions such as what are the most negative episodes in each of Rick and Morty’s seasons? First, let’s get the list of negative words from the Bing lexicon. Second, let’s make a dataframe of how many words are in each chapter so we can normalize for the length of chapters. Then, let’s find the number of negative words in each chapter and divide by the total words in each chapter. Which chapter has the highest proportion of negative words?

bingnegative = sentiments %>%
filter(lexicon == "bing", sentiment == "negative")

wordcounts = rick_and_morty_subs_tidy %>%
group_by(season, episode) %>%
summarize(words = n())

rick_and_morty_subs_tidy %>%
semi_join(bingnegative) %>%
group_by(season, episode) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("season", "episode")) %>%
mutate(ratio = negativewords/words) %>%
top_n(1)

# A tibble: 3 x 5
# Groups:   season [3]
season episode negativewords words     ratio
<chr>   <chr>         <int> <int>     <dbl>
1    S01     E02           124  1036 0.1196911
2    S02     E01           184  1386 0.1327561
3    S03     E06           197  1486 0.1325707


# Networks of Words

Another function in widyr is pairwise_count, which counts pairs of items that occur together within a group. Let’s count the words that occur together in the lines of the first season.

rick_and_morty_words = rick_and_morty_subs_tidy %>%
filter(season == "S01")

word_cooccurences = rick_and_morty_words %>%
pairwise_count(word, linenumber, sort = TRUE)

word_cooccurences

# A tibble: 221,364 x 3
item1 item2     n
<chr> <chr> <dbl>
1 morty  rick   461
2  rick morty   461
3 jerry  rick   234
4  rick jerry   234
5 jerry morty   228
6 morty jerry   228
7  yeah  rick   136
8  rick  yeah   136
9  yeah morty   130
10 morty  yeah   130
# ... with 221,354 more rows


This can be useful, for example, to plot a network of co-occuring words with the igraph and ggraph packages.

set.seed(1717)

word_cooccurences %>%
filter(n >= 25) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "#a8a8a8") +
geom_node_point(color = "darkslategray4", size = 8) +
geom_node_text(aes(label = name), vjust = 2.2) +
ggtitle(expression(paste("Word Network in Rick and Morty's ",
italic("Season One")))) +
theme_void()


It looks good! at least it contains the main characters and Rick’s swearing.