[This article was first published on data science ish, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The inspiration for this post is a joint venture by both me and my husband, and its genesis lies more than 15 years in our past. One of the recurring conversations we have in our relationship (all long-term relationships have these, right?!) is about song lyrics and place names. I think the first time we ever had this conversation was in the late 1990s and was about Baltimore. “Why do so many songs talk about Baltimore?” we asked each other. “And why does it always sound so miserable there?” At the time we were listening to a lot of Lyle Lovett, and Counting Crows was on the radio a lot.
We have continued to have this conversation many, many times over our years together, noticing state and city names in song lyrics and wondering if or why certain places are mentioned more often. Are certain locations mentioned in song lyrics at a higher rate, perhaps at a higher rate relative to their population? I’ve recently realized that I know of pretty good data sets to make a stab at answering this, so let’s go!
Downloading Population Data for U.S. States
For this first blog post, I am only going to look at mentions of state names, so let’s download state population data from the U.S. Census Bureau. I use Census data from the American Community Survey for my work, so let’s use the acs package to find the most recent total population estimates for each state. (If you haven’t used the acs package before, you will need to get an API key and run api.key.install() one time to install your key on your system.)
What do we have here, just to check?
There we go! We now have a data frame ready to go with the state names and their corresponding populations.
For a data set of song lyrics, I am going to use the compilation of Billboard’s Year-End Hot 100 from 1958 to the present put together by Kaylin Walker. Her analysis is wonderful and so fun, and she has the data as well as her code for scraping/analysis on GitHub. This is a data set of pop lyrics; this means that a) my beloved Lyle Lovett is not in it and b) it is certainly going to be biased in certain ways compared to other genres when it comes to mentions of place names. However, it is somewhere to start.
Finding the State Names in the Song Lyrics
Now we need to find the mentions of each state as they appear in these song lyrics. State names are one or two words, so we will use unnest_tokens from the tidytext package, but we will do it twice. First, we’ll unnest looking for single words and then we’ll unnest making bigrams, all the combination of two words in the song lyrics. We will bind these two data frames together with all the possible words and bigrams that might contain state names.
The variable state_name in this data frame contains all the possible words and bigrams that might be state names in all the lyrics.
Now we can use an inner join to find all the state names that are actually there.
Let’s only count each state once per song that it is mentioned in.
Let’s count these up now!
Now, I am going to use my vast knowledge of pop culture here and suggest that these mentions of New York are referencing New York City, not the state of New York, as lovely as it may be. I’ll keep them in for now but we should be aware of that. Also, I am a bit surprised the numbers are this low overall; this makes me long for BIGGER DATA.
Let’s calculate a number relative to the population of each state (mentions per million population).
I was a little surprised that Maine was so high so I checked on those.
“King of the Road”, OK, sure, but it turns out that Mack Maine is a rap artist who is the president of a label named Young Money. It is possible there are other examples of this kind of confusion in this analysis, but I checked most of the other states and did not find anyway. The other state names seen here seem less likely to fall into such a mistake anyway. Let’s drop Maine’s number down to 1 and recalculate the rate.
Making a Map
Let’s map these values so we can visualize which states have more or fewer mentions in the Billboard Year-End Hot 100. I’m going to use the minimap package from Sean Kross because I think a tile grid map is a good way to display this kind of information. (I don’t want the relative geographical areas of states to mess too much with people’s visual perception here.)
The minimap package needs two things (mainly) to make a map: a vector of state postal abbreviations and a vector of colors. Let’s work on making those.
Now let’s make some maps.
LOOK, EVERYONE, I DID BASE GRAPHICS. (After I made these plots, I rediscovered that Bob Rudis has a ggplot-based package for a similar tile grid map called statebins.) Also, as a reminder, we can probably ignore the numbers for New York, as they all appear to reference New York City, not the state.
Let’s combine these into an animated GIF using the magick package.
Another way we might visualize this kind of information could be a cartogram, where the geometry of a map is distorted to show some variable. You can see some comparisons of tile grid maps (square and hexagonal) and a cartogram at this NPR post from last year. There is an R package from Sebastian Jeworutzki that will create a cartogram from a SpatialPolygonDataFrame, so let’s give it a go.
The cartogram function could not accept some of the states having NA or zero for their rate value, which makes sense. When I tried using a very small number for the states which have zero mentions in this data set, the algorithm could not converge in a reasonable amount of time. I ended up using a small-ish but not-too-close to zero number for those states in order to have the distorting algorithm converge. Anyway, that code above has done the distorting; now let’s map this.
I’m actually not so sure about this one. It’s cool that it is possible but I think I prefer the tile map for actually communicating the information.
Both kinds of maps show how important states like Mississippi, Georgia, Alabama, Tennessee, and Kentucky are in song lyrics. And remember that this is nominally pop music, not country music per se! Hawaii and Montana also have strong showings, relative to their populations.
The rates per million population presented in the map are more uncertain for states that were mentioned in, say, 2 songs (like Montana) than for states that were mentioned many more times (like Georgia), even if those numbers relative to population were about the same. Georgia was mentioned about 10 times more often than Montana, meaning the sample size used to calculate Georgia’s rate is about 10 times bigger than the sample size used to calculate Montana’s rate. Thanks to our old friend, the Central Limit Theorem, this means the uncertainty associated with Montana’s rate measurement is about times bigger. For a more rigorous analysis, it might be worth calculating those differences in uncertainty and reporting them.
I would like to extend this analysis to city names next, but I feel like I barely eked out anything useful or meaningful here, given the song counts I ended up with. I would love to work with a different data set of song lyrics that included more lyrics and/or more genres of music; I’ve thought about doing something with the Million Song Dataset from musiXmatch, or maybe I need to do some scraping myself. The R Markdown file used to make this blog post is available here. I am very happy to hear feedback or questions!
To leave a comment for the author, please follow the link and comment on their blog: data science ish.