How Has Taylor Swift’s Word Choice Changed Over Time?

Posted on May 22, 2018 by in R bloggers | 0 Comments

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How Has Taylor Swift’s Word Choice Changed Over Time? Sunday night was a big night for Taylor Swift – not only was she nominated for multiple Billboard Music Awards; she took home Top Female Artist and Top Selling Album. So I thought it was a good time for some more Taylor Swift-themed statistical analysis.

When I started this blog back in 2011, my goal was to write deep thoughts on trivial topics – specifically, to overthink and overanalyze pop culture and related topics that appear fluffy until you really dig into them. Recently, I’ve been blogging more about statistics, research, R, and data science, and I’ve loved getting to teach and share.

But sometimes, you just want to overthink and overanalyze pop culture.

So in a similar vein to the text analysis I’ve been demonstrating on my blog, I decided to answer a question I’m sure we all have – as Taylor Swift moved from country sweetheart to mega pop star, how have the words she uses in her songs changed?

I’ve used the geniusR package on a couple posts, and I’ll be using it again today to answer this question. I’ll be pulling in some additional code, some based on code from the Text Mining with R: A Tidy Approach book I recently devoured, some written to try to tackle this problem I’ve created for myself to solve. I’ve shared all my code and tried to credit those who helped me write it where I can.

First, we want to pull in the names of Taylor Swift’s 6 studio albums. I found these and their release dates on Wikipedia. While there are only 6 and I could easily copy and paste them to create my data frame, I wanted to pull that data directly from Wikipedia, to write code that could be used on a larger set in the future. Thanks to this post, I could, with a couple small tweaks.

library(rvest)

## Loading required package: xml2

TSdisc <- 'https://en.wikipedia.org/wiki/Taylor_Swift_discography'

disc <- TSdisc %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[2]') %>%
  html_table(fill = TRUE)

Since html() is deprecated, I replaced it with read_html(), and I got errors if I didn’t add fill = TRUE. The result is a list of 1, with an 8 by 14 data frame within that single list object. I can pull that out as a separate data frame.

TS_albums <- disc[[1]]

The data frame requires a little cleaning. First up, there are 8 rows, but only 6 albums. Because the Wikipedia table had a double header, the second header was read in as a row of data, so I want to delete that, because I only care about the first two columns anyway. The last row contains a footnote that was included with the table. So I removed those two rows, first and last, and dropped the columns I don't need. Second, the information I want with release date was in a table cell along with record label and formats (e.g., CD, vinyl). I don't need those for my purposes, so I'll only pull out the information I want and drop the rest. Finally, I converted year from character to numeric - this becomes important later on.

library(tidyverse)

TS_albums<-TS_albums[2:7,1:2]

TS_albums <- TS_albums %>%
  separate(`Album details`, c("Released","Month","Day","Year"),
           extra='drop') %>%
  select(c("Title","Year"))

TS_albums$Year<-as.numeric(TS_albums$Year)

I asked geniusR to download lyrics for all 6 albums. (Note: this code may take a couple minutes to run.) It nests all of the individual album data, including lyrics, into a single column, so I just need to unnest that to create a long file, with album title and release year applied to each unnested line.

library(geniusR)

TS_lyrics <- TS_albums %>%
  mutate(tracks = map2("Taylor Swift", Title, genius_album))

## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")

TS_lyrics <- TS_lyrics %>%
  unnest(tracks)

Now we'll tokenize our lyrics data frame, and start doing our word analysis.

library(tidytext)

tidy_TS <- TS_lyrics %>%
  unnest_tokens(word, lyric) %>%
  anti_join(stop_words)

## Joining, by = "word"

tidy_TS %>%
  count(word, sort = TRUE)

## # A tibble: 2,024 x 2
##    word      n
##    <chr> <int>
##  1 time    198
##  2 love    180
##  3 baby    118
##  4 ooh     104
##  5 stay     89
##  6 night    85
##  7 wanna    84
##  8 yeah     83
##  9 shake    80
## 10 ey       72
## # ... with 2,014 more rows

There are a little over 2,000 unique words across TS's 6 albums. But how have they changed over time? To examine this, I'll create a dataset that counts word by year (or album, really). Then I'll use a binomial regression model to look at changes over time, one model per word. In their book, Julia Silge and David Robinson demonstrated how to use binomial regression to examine word use on the authors' Twitter accounts over time, including an adjustment to the p-values to correct for multiple comparisons. So I based on my code off that.

words_by_year <- tidy_TS %>%
  count(Year, word) %>%
  group_by(Year) %>%
  mutate(time_total = sum(n)) %>%
  group_by(word) %>%
  mutate(word_total = sum(n)) %>%
  ungroup() %>%
  rename(count = n) %>%
  filter(word_total > 50)

nested_words <- words_by_year %>%
  nest(-word)

word_models <- nested_words %>%
  mutate(models = map(data, ~glm(cbind(count, time_total) ~ Year, .,
                                 family = "binomial")))

This nests our regression results in a data frame called word_models. While I could unnest and keep all, I don't care about every value the GLM gives me. What I care about is the slope for Year, so the filter selects only that slope and the associated p-value. I can then filter to select the significant/marginally significant slopes for plotting (p < 0.1).

library(broom)

slopes <- word_models %>%
  unnest(map(models, tidy)) %>%
  filter(term == "Year") %>%
  mutate(adjusted.p.value = p.adjust(p.value))

top_slopes <- slopes%>%
  filter(adjusted.p.value < 0.1) %>%
  select(-statistic, -p.value)

This gives me five words that show changes in usage over time: bad, call, dancing, eyes, and yeah. We can plot those five words to see how they've changed in usage over her 6 albums. And because I still have my TS_albums data frame, I can use that information to label the axis of my plot (which is why I needed year to be numeric). I also added a vertical line and annotations to note where TS believes she shifted from country to pop.

library(scales)

words_by_year %>%
  inner_join(top_slopes, by = "word") %>%
  ggplot(aes(Year, count/time_total, color = word, lty = word)) +
  geom_line(size = 1.3) +
  labs(x = NULL, y = "Word Frequency") +
  scale_x_continuous(breaks=TS_albums$Year,
                     labels=TS_albums$Title) +
  scale_y_continuous(labels=scales::percent) +
  geom_vline(xintercept = 2014) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank()) +
  annotate("text", x = c(2009.5,2015.5), y = c(0.025,0.025),
           label = c("Country", "Pop") , size=5)

Title	Year	track_title	lyric
Speak Now	2010	Back to December (Acoustic)	When your birthday passed, and I didn't call
Red	2012	All Too Well	And you call me up again just to break me like a promise
Reputation	2017	Call It What You Want	Call it what you want, call it what you want, call it

Title	Year	track_title	lyric
Taylor Swift	2006	A Perfectly Good Heart	And realized by the distance in your eyes that I would be the one to fall
Speak Now	2010	Better Than Revenge	I'm just another thing for you to roll your eyes at, honey
Red	2012	State of Grace	Just twin fire signs, four blue eyes

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

How Has Taylor Swift’s Word Choice Changed Over Time?

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)