Statistics Sunday: Taylor Swift vs. Lorde – Analyzing Song Lyrics

May 13, 2018
By

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

Statistics Sunday Last week, I showed how to tokenize text. Today I’ll use those functions to do some text analysis of one of my favorite types of text: song lyrics. Plus, this is a great opportunity to demonstrate a new R package I discovered: geniusR, which will download lyrics from Genius.

There are two packages – geniusR and geniusr – which will do this. I played with both and found geniusR easier to use. Neither is perfect, but what is perfect, anyway?

To install geniusR, you’ll use a different method than usual – you’ll need to install the package devtools, then call the install_github function to download the R package directly from GitHub.

install.packages("devtools")
devtools::install_github("josiahparry/geniusR")
## Downloading GitHub repo josiahparry/[email protected]
## from URL https://api.github.com/repos/josiahparry/geniusR/zipball/master
## Installing geniusR
## '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file  \
## --no-environ --no-save --no-restore --quiet CMD INSTALL \
## '/private/var/folders/85/9ygtlz0s4nxbmx3kgkvbs5g80000gn/T/Rtmpl3bwRx/devtools33c73e3f989/JosiahParry-geniusR-5907d82' \
## --library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library' \
## --install-tests
## 

Now you’ll want to load geniusR and tidyverse so we can work with our data.

library(geniusR)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.3.0
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()

For today’s demonstration, I’ll be working with data from two artists I love: Taylor Swift and Lorde. Both dropped new albums last year, Reputation and Melodrama, respectively, and both, though similar in age and friends with each other, have very different writing and musical styles.

geniusR has a function genius_album that will download lyrics from an entire album, labeling it by track.

swift_lyrics <- genius_album(artist="Taylor Swift", album="Reputation")
## Joining, by = c("track_title", "track_n", "track_url")
lorde_lyrics <- genius_album(artist="Lorde", album="Melodrama")
## Joining, by = c("track_title", "track_n", "track_url")

Now we want to tokenize our datasets, remove stop words, and count word frequency – this code should look familiar, except this time, I’m combining them using the pipeline symbol (%>%) from the tidyverse, which allows you to string together multiple functions without having to nest them.

library(tidytext)
tidy_swift <- swift_lyrics %>%
unnest_tokens(word,lyric) %>%
anti_join(stop_words) %>%
count(word, sort=TRUE)
## Joining, by = "word"
head(tidy_swift)
## # A tibble: 6 x 2
## word n
##
## 1 call 46
## 2 wanna 37
## 3 ooh 35
## 4 ha 34
## 5 ah 33
## 6 time 32
tidy_lorde <- lorde_lyrics %>%
unnest_tokens(word,lyric) %>%
anti_join(stop_words) %>%
count(word, sort=TRUE)
## Joining, by = "word"
head(tidy_lorde)
## # A tibble: 6 x 2
## word n
##
## 1 boom 40
## 2 love 26
## 3 shit 24
## 4 dynamite 22
## 5 homemade 22
## 6 light 22

Looking at the top 6 words for each, it doesn’t look like there will be a lot of overlap. But let’s explore that, shall we? Lorde’s album is 3 tracks shorter than Taylor Swift’s. To make sure our word comparisons are meaningful, I’ll create new variables that takes into account total number of words, so each word metric will be a proportion, allowing for direct comparisons. And because I’ll be joining the datasets, I’ll be sure to label these new columns by artist name.

tidy_swift <- tidy_swift %>%
rename(swift_n = n) %>%
mutate(swift_prop = swift_n/sum(swift_n))

tidy_lorde <- tidy_lorde %>%
rename(lorde_n = n) %>%
mutate(lorde_prop = lorde_n/sum(lorde_n))

There are multiple types of joins available in the tidyverse. I used an anti_join to remove stop words. Today, I want to use a full_join, because I want my final dataset to retain all words from both artists. When one dataset contributes a word not found in the other artist’s set, it will fill those variables in with missing values.

compare_words <- tidy_swift %>%
full_join(tidy_lorde, by = "word")

summary(compare_words)
##      word              swift_n         swift_prop         lorde_n    
## Length:957 Min. : 1.000 Min. :0.00050 Min. : 1.0
## Class :character 1st Qu.: 1.000 1st Qu.:0.00050 1st Qu.: 1.0
## Mode :character Median : 1.000 Median :0.00050 Median : 1.0
## Mean : 3.021 Mean :0.00152 Mean : 2.9
## 3rd Qu.: 3.000 3rd Qu.:0.00151 3rd Qu.: 3.0
## Max. :46.000 Max. :0.02321 Max. :40.0
## NA's :301 NA's :301 NA's :508
## lorde_prop
## Min. :0.0008
## 1st Qu.:0.0008
## Median :0.0008
## Mean :0.0022
## 3rd Qu.:0.0023
## Max. :0.0307
## NA's :508

The final dataset contains 957 tokens – unique words – and the NAs tell how many words are only present in one artist’s corpus. Lorde uses 301 words Taylor Swift does not, and Taylor Swift uses 508 words that Lorde does not. That leaves 148 words on which they overlap.

There are many things we could do with these data, but let’s visualize words and proportions, with one artist on the x-axis and the other on the y-axis.

ggplot(compare_words, aes(x=swift_prop, y=lorde_prop)) +
geom_abline() +
geom_text(aes(label=word), check_overlap=TRUE, vjust=1.5) +
labs(y="Lorde", x="Taylor Swift") + theme_classic()
## Warning: Removed 809 rows containing missing values (geom_text).

The warning lets me know there are 809 rows with missing values – those are the words only present in one artist’s corpus. Words that fall on or near the line are used at similar rates between artists. Words above the line are used more by Lorde than Taylor Swift, and words below the line are used more by Taylor Swift than Lorde. This tells us that, for instance, Lorde uses “love,” “light,” and, yes, “shit,” more than Swift, while Swift uses “call,” “wanna,” and “hands” more than Lorde. They use words like “waiting,” “heart,” and “dreams” at similar rates. Rates are low overall, but if you look at the max values for the proportion variables, Swift’s most common word only accounts for about 2.3% of her total words; Lorde’s most common word only accounts for about 3.1% of her total words.

This highlights why it’s important to remove stop words for these types of analyses; otherwise, our datasets and chart would be full of words like “the,” “a”, and “and.”

Next Statistics Sunday, we’ll take a look at sentiment analysis!

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)