# Rick and Morty and Tidy Data Principles (Part 2)

October 21, 2017
By

(This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers)

# Motivation

The first part left an open door to analyze Rick and Morty contents using tf-idf, bag-of-words or some other NLP techniques. Here I’m also taking a lot of ideas from Julia Silge‘s blog.

Note: If some images appear too small on your screen you can open them in a new tab to show them in their original size.

# Term Frequency

The most basic measure in natural language processing is obviously to just count words. This is a crude way of knowing what a document is about. The problem with counting words, however, is that there are some words (called stopwords) that are always too common, like “the” or “that”. So to create a more meaningful representation what people usually do is to compare the word counts observed in a document with that of a larger body of text.

Tf-idf is the frequency of a term adjusted for how rarely it is used. It is intended to measure how important a word is to a document in a collection (or corpus) of documents.

The inverse document frequency for any given term is defined as:

$$idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$

We can use tidy data principles to approach tf-idf analysis and use consistent, effective tools to quantify how important various terms are in a document that is part of a collection.

# What do Rick and Morty say?

Let’s start by looking at Rick and Morty dialogues and examine first term frequency, then tf-idf. I’ll analyze this removing stopwords beforehand.

if (!require("pacman")) install.packages("pacman")

rick_and_morty_subs_tidy = rick_and_morty_subs %>%
unnest_tokens(word,text) %>%
anti_join(stop_words) %>%
count(season, word, sort = TRUE)

total_words <- rick_and_morty_subs_tidy %>% group_by(season) %>% summarize(total = sum(n))
season_words <- left_join(rick_and_morty_subs_tidy, total_words)

season_words

# A tibble: 11,933 x 4
season   word     n total

1    S01  morty  1283 19381
2    S01   rick  1132 19381
3    S01  jerry   475 19381
4    S03  morty   331 13008
5    S03   rick   251 13008
6    S02   rick   242 12829
7    S02  morty   228 12829
8    S01   beth   224 19381
9    S01 summer   215 19381
10    S01   yeah   209 19381
# ... with 11,923 more rows


Let’s look at the distribution of n/total for each season, the number of times a word appears in a season divided by the total number of terms (words) in that season. This is term frequency!

ggplot(season_words, aes(n/total, fill = season)) +
geom_histogram(alpha = 0.8, show.legend = FALSE) +
xlim(0, 0.001) +
labs(title = "Term Frequency Distribution in Rick and Morty' Seasons",
y = "Count") +
facet_wrap(~season, nrow = 3, scales = "free_y") +
theme_minimal(base_size = 13) +
scale_fill_viridis(end = 0.85, discrete=TRUE) +
theme(strip.text=element_text(hjust=0)) +
theme(strip.text = element_text(face = "italic"))


There are very long tails to the right for these dialogues because of the extremely common words. These plots exhibit similar distributions for each season, with many words that occur rarely and fewer words that occur frequently. The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents, in this case, the group of Rick and Morty’ seasons as a whole. Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common. Let’s do that now.

season_words <- season_words %>%
bind_tf_idf(word, season, n)

season_words

# A tibble: 11,933 x 7
season   word     n total         tf   idf tf_idf

1    S01  morty  1283 19381 0.06619885     0      0
2    S01   rick  1132 19381 0.05840772     0      0
3    S01  jerry   475 19381 0.02450854     0      0
4    S03  morty   331 13008 0.02544588     0      0
5    S03   rick   251 13008 0.01929582     0      0
6    S02   rick   242 12829 0.01886351     0      0
7    S02  morty   228 12829 0.01777223     0      0
8    S01   beth   224 19381 0.01155771     0      0
9    S01 summer   215 19381 0.01109334     0      0
10    S01   yeah   209 19381 0.01078376     0      0
# ... with 11,923 more rows


Notice that idf and thus tf-idf are zero for the extremely common words after removing stopwords. These are all words that appear all the time on every chapter, so the idf term (which will then be the natural log of 1) is zero, and “Rick” and “Morty” are examples of this. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the documents in a collection; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the documents in the collection. Let’s look at terms with high tf-idf.

season_words %>%
select(-total) %>%
arrange(desc(tf_idf))

# A tibble: 11,933 x 6
season        word     n          tf      idf      tf_idf

1    S03      pickle    43 0.003305658 1.098612 0.003631637
2    S02       unity    32 0.002494349 1.098612 0.002740322
3    S01    meeseeks    44 0.002270265 1.098612 0.002494141
4    S03 vindicators    26 0.001998770 1.098612 0.002195873
5    S02       purge    25 0.001948710 1.098612 0.002140877
6    S01         flu    32 0.001651102 1.098612 0.001813921
7    S01    crystals    30 0.001547908 1.098612 0.001700550
8    S03       tommy    20 0.001537515 1.098612 0.001689133
9    S02        deer    19 0.001481020 1.098612 0.001627066
10    S03        noob    18 0.001383764 1.098612 0.001520220
# ... with 11,923 more rows


Curious about “pickle”? You’d better watch Picle Rick episode if you don’t get why “pickle” is the highest tf-idf ranked term. “Vindicator” is another term that is concentrated in one episode where Vindicators appear. There’s even an episode where flu is a part of the central problem and Rick has to use his mind to try to solve a flu of of control because of his inventions.

Some of the values for idf are the same for different terms because there are 6 documents in this corpus and we are seeing the numerical value for ln(6/1), ln(6/2), etc. Let’s look at a visualization for these high tf-idf words.

plot_tfidf <- season_words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word))))

ggplot(plot_tfidf[1:20,], aes(tf_idf, word, fill = season, alpha = tf_idf)) +
geom_barh(stat = "identity") +
labs(title = "Highest tf-idf words in Rick and Morty' Seasons",
y = NULL, x = "tf-idf") +
theme_minimal(base_size = 13) +
scale_alpha_continuous(range = c(0.6, 1), guide = FALSE) +
scale_x_continuous(expand=c(0,0)) +
scale_fill_viridis(end = 0.85, discrete=TRUE) +
theme(legend.title=element_blank()) +
theme(legend.justification=c(1,0), legend.position=c(1,0))


Let’s look at the seasons individually.

plot_tfidf <- plot_tfidf %>% group_by(season) %>% top_n(15) %>% ungroup()

ggplot(plot_tfidf, aes(tf_idf, word, fill = season, alpha = tf_idf)) +
geom_barh(stat = "identity", show.legend = FALSE) +
labs(title = "Highest tf-idf words in Rick and Morty' Seasons",
y = NULL, x = "tf-idf") +
facet_wrap(~season, nrow = 3, scales = "free") +
theme_minimal(base_size = 13) +
scale_alpha_continuous(range = c(0.6, 1)) +
scale_x_continuous(expand=c(0,0)) +
scale_fill_viridis(end = 0.85, discrete=TRUE) +
theme(strip.text=element_text(hjust=0)) +
theme(strip.text = element_text(face = "italic"))


R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...