Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Statistics Sunday Last week, I showed how to tokenize text. Today I’ll use those functions to do some text analysis of one of my favorite types of text: song lyrics. Plus, this is a great opportunity to demonstrate a new R package I discovered: geniusR, which will download lyrics from Genius.

There are two packages – geniusR and geniusr – which will do this. I played with both and found geniusR easier to use. Neither is perfect, but what is perfect, anyway?

To install geniusR, you’ll use a different method than usual – you’ll need to install the package devtools, then call the install_github function to download the R package directly from GitHub.

install.packages("devtools")

devtools::install_github("josiahparry/geniusR")

## from URL https://api.github.com/repos/josiahparry/geniusR/zipball/master

## Installing geniusR

## '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file  \
##   --no-environ --no-save --no-restore --quiet CMD INSTALL  \
##   '/private/var/folders/85/9ygtlz0s4nxbmx3kgkvbs5g80000gn/T/Rtmpl3bwRx/devtools33c73e3f989/JosiahParry-geniusR-5907d82'  \
##   --library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library'  \
##   --install-tests

##


Now you’ll want to load geniusR and tidyverse so we can work with our data.

library(geniusR)
library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.0
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──


For today’s demonstration, I’ll be working with data from two artists I love: Taylor Swift and Lorde. Both dropped new albums last year, Reputation and Melodrama, respectively, and both, though similar in age and friends with each other, have very different writing and musical styles.

geniusR has a function genius_album that will download lyrics from an entire album, labeling it by track.

swift_lyrics <- genius_album(artist="Taylor Swift", album="Reputation")

## Joining, by = c("track_title", "track_n", "track_url")

lorde_lyrics <- genius_album(artist="Lorde", album="Melodrama")

## Joining, by = c("track_title", "track_n", "track_url")


Now we want to tokenize our datasets, remove stop words, and count word frequency - this code should look familiar, except this time, I'm combining them using the pipeline symbol (%>%) from the tidyverse, which allows you to string together multiple functions without having to nest them.

library(tidytext)
tidy_swift <- swift_lyrics %>%
unnest_tokens(word,lyric) %>%
anti_join(stop_words) %>%
count(word, sort=TRUE)

## Joining, by = "word"

## # A tibble: 6 x 2
##   word      n
##
## 1 call     46
## 2 wanna    37
## 3 ooh      35
## 4 ha       34
## 5 ah       33
## 6 time     32

tidy_lorde <- lorde_lyrics %>%
unnest_tokens(word,lyric) %>%
anti_join(stop_words) %>%
count(word, sort=TRUE)

## Joining, by = "word"

## # A tibble: 6 x 2
##   word         n
##
## 1 boom        40
## 2 love        26
## 3 shit        24
## 4 dynamite    22
## 6 light       22


Looking at the top 6 words for each, it doesn't look like there will be a lot of overlap. But let's explore that, shall we? Lorde's album is 3 tracks shorter than Taylor Swift's. To make sure our word comparisons are meaningful, I'll create new variables that takes into account total number of words, so each word metric will be a proportion, allowing for direct comparisons. And because I'll be joining the datasets, I'll be sure to label these new columns by artist name.

tidy_swift <- tidy_swift %>%
rename(swift_n = n) %>%
mutate(swift_prop = swift_n/sum(swift_n))

tidy_lorde <- tidy_lorde %>%
rename(lorde_n = n) %>%
mutate(lorde_prop = lorde_n/sum(lorde_n))


There are multiple types of joins available in the tidyverse. I used an anti_join to remove stop words. Today, I want to use a full_join, because I want my final dataset to retain all words from both artists. When one dataset contributes a word not found in the other artist's set, it will fill those variables in with missing values.

compare_words <- tidy_swift %>%
full_join(tidy_lorde, by = "word")

summary(compare_words)

##      word              swift_n         swift_prop         lorde_n
##  Length:957         Min.   : 1.000   Min.   :0.00050   Min.   : 1.0
##  Class :character   1st Qu.: 1.000   1st Qu.:0.00050   1st Qu.: 1.0
##  Mode  :character   Median : 1.000   Median :0.00050   Median : 1.0
##                     Mean   : 3.021   Mean   :0.00152   Mean   : 2.9
##                     3rd Qu.: 3.000   3rd Qu.:0.00151   3rd Qu.: 3.0
##                     Max.   :46.000   Max.   :0.02321   Max.   :40.0
##                     NA's   :301      NA's   :301       NA's   :508
##    lorde_prop
##  Min.   :0.0008
##  1st Qu.:0.0008
##  Median :0.0008
##  Mean   :0.0022
##  3rd Qu.:0.0023
##  Max.   :0.0307
##  NA's   :508


The final dataset contains 957 tokens - unique words - and the NAs tell how many words are only present in one artist's corpus. Lorde uses 301 words Taylor Swift does not, and Taylor Swift uses 508 words that Lorde does not. That leaves 148 words on which they overlap.

There are many things we could do with these data, but let's visualize words and proportions, with one artist on the x-axis and the other on the y-axis.

ggplot(compare_words, aes(x=swift_prop, y=lorde_prop)) +
geom_abline() +
geom_text(aes(label=word), check_overlap=TRUE, vjust=1.5) +
labs(y="Lorde", x="Taylor Swift") + theme_classic()

## Warning: Removed 809 rows containing missing values (geom_text).


The warning lets me know there are 809 rows with missing values - those are the words only present in one artist's corpus. Words that fall on or near the line are used at similar rates between artists. Words above the line are used more by Lorde than Taylor Swift, and words below the line are used more by Taylor Swift than Lorde. This tells us that, for instance, Lorde uses "love," "light," and, yes, "shit," more than Swift, while Swift uses "call," "wanna," and "hands" more than Lorde. They use words like "waiting," "heart," and "dreams" at similar rates. Rates are low overall, but if you look at the max values for the proportion variables, Swift's most common word only accounts for about 2.3% of her total words; Lorde's most common word only accounts for about 3.1% of her total words.

This highlights why it's important to remove stop words for these types of analyses; otherwise, our datasets and chart would be full of words like "the," "a", and "and."

Next Statistics Sunday, we'll take a look at sentiment analysis!