To be honest, the Radiohead I know and love and remember is that which was a rock band without a lot of ‘experimental’ tracks – a band you discovered on Big Shiny Tunes 2, or because your friends told you about it, or because it was playing in the background of a bar you were at sometime in the 90’s.
But I really do like their music, I’ve become familiar with more of it and overall it does possess a certain unique character in its entirety. Their range is so diverse and has changed so much over the years that it would be really hard not to find at least one track that someone will like. In this way they are very much like the Beatles, I suppose.
I was interested in doing some more content analysis type work and text mining in R, so I thought I’d try song lyrics and Radiohead immediately came to mind.
Normally it would be simply a matter of throwing something together in Python using Beautiful Soup as I have done previously. Unfortunately, due to the way these particular pages were coded, that proved to be a bit more difficult than expected.
Sometimes sitting down beforehand and looking at where you are getting it from, the format it is in and how to best go about getting it into the format you need will save you a lot of wasted time and frustration in the long run. Ask questions before you begin – what format is the data in now? What is the format I need/would like it to be in to do the analysis? What steps are required in order to get from one to the other (i.e. what is the data transformation or mapping process)?
Make it easy on your other developers (and the rest of the world in general) by labeling your
There is a large range of word counts, from the two truly instrumental tracks (Treefingers on Kid A and Hunting Bears on Amnesiac) to the wordier tracks (Dollars and Cents and A Wolf at the Door). Pablo Honey almost looks like it has two categories of songs – with a split around the 80 word mark.
Okay, interesting and all, but again these are small amounts of data and only so much can be drawn out as such.
Going forward we examine two calculated quantities.
Calculated Quantities – Lexical Density and ‘Lyrical Density’
In the realm of content analysis there is a measure known as lexical density which is a measure of the number of content words as a proportion of the total number of words – a value which ranges from 0 to 100. In general, the greater the lexical density of a text, the more content heavy it is and more ‘unpacking’ it takes to understand – texts with low lexical density are easier to understand.
According to Wikipedia the formula is as follows:
where Ld is the analysed text’s lexical density, NLex is the number of lexical word tokens (nouns, adjectives, verbs, adverbs) in the analysed text, and N is the number of all tokens (total number of words) in the analysed text.
Now, I am not a linguist, however it sounds like this is just the ratio of words which are not stopwords to the total number – or could at least be approximated by it. That’s what I went with in the calculations in R using the tm package (because I’m not going to write a package to calculate lexical density by myself).
On a related note, I completely made up a quantity which I am calling ‘lyrical density’ which is much easier to calculate and understand – this is just the number of lyrics per song over the track length, and is measured in words per second. An instrumental track would have lyrical density of zero, and a song with one word per second for the whole track would have a lyrical density of 1.
Interestingly, there are outlying tracks near the high end where the proportion of words to the song length is greater than 1 (Fitter Happier, A Wolf at the Door, and Faust Arp). Fitter Happier shouldn’t even really count, as it is really an instrumental track with a synthesized voice dubbed overtop. If you listen to A Wolf at the Door it is clear why the lyrical density is so high – Thom is practically rapping at points. Otherwise Kid A and The King of Limbs seem to have less quickly sung lyrics than the other albums on average.
Lexical Density + Lyrical Density
Putting it all together, we can examine the quantities for all of the Radiohead songs in one data visualization. You can examine different albums by clicking the color legend at the right, and compare multiple albums by holding CTRL and clicking more than one.
The songs are colour-coded by album. The points are plotted by lexical density along y-axis against the lyrical density along the x-axis and sized by total number of words in the song. As such, the position of the point in the plot gives an idea of the rate of lyrical content in the track – a song like I Might Be Wrong is fitting a lot less content words into a song at a slower rate than a track like A Wolf at the Door which is packed much tighter with both lyrics and meaning.
References & Resources