Using Last.fm to data mine my music listening history

(This article was first published on On The Lambda, and kindly contributed to R-bloggers)

Indie Rock
I've (passively) been keeping meticulous records of almost every song I've listened to since January of 2008. Since I opened my last.fm account 6 years ago, they've accumulated a massive detailed dataset of the 107,222 songs I've listened to since then. The best thing is that they're willing to share this data with me!

I used the last.fm developer REST API to (over a very long period of time) retrieve my entire listening history, the date(s) that I've listened to each song, and the top three user-submitted "tags" for each song.

I want to glean every bit of insight that I can out of this data. For this post, I focused on:

  • total listening history over time
  • music "diversity" levels
  • trends in my musical genre listening habits

In future posts, I hope to explore other things like using PCA to determine "orthogonal" music genres, construct similarity matrices, predict trends, and perform acoustic analysis.

This has been one of my favorite pet-projects because it combines three things that I love:

  • data mining
  • music
  • navel-gazing

I used both R and Python in this analysis. Let’s get into it!

Obtaining data
Getting the data using the last.fm REST API was very straightforward; the only hiccups I encountered were the fault of Python2's unicode snafus. For the web requests I used the urllib2 module and to handle the XML responses I used the amazing lxml module. The code to get my whole listening history looked a little like this:

#!/usr/bin/env python -tt

import urllib2
import time
from lxml import etree
from StringIO import StringIO

baseurl = ''.join(["http://ws.audioscrobbler.com/2.0/",
                   "?method=user.getrecenttracks",
                   "&user=statethatiamin&api_key=XXXXXXX&limit=200"])

def clean_xml(the_xml):
    return "\n".join(the_xml.split("\n")[3:-2])

# let's get the first page so we know how many pages there are
response = urllib2.urlopen(baseurl+"&page=1", timeout=200)
html = response.read()

# parse the XML tree
doc = etree.parse(StringIO(html))

# use Xpath to query the number of pages
num_pages = int(doc.xpath("/lfm/recenttracks")[0].get("totalPages"))

# file to dump results
fh = open("all_the_tracks.xml", "a")

for page in xrange(0, num_pages+1):
    # I'm nice so I don't want to hit last.fm
    # with a bunch of requests a second.
    # Let's wait ten seconds between each request
    time.sleep(10)
    progress = "On page {} of {}...........  {}%"
    print progress.format(str(page),
                          str(num_pages),
                          str(round(float(page)/num_pages*100, 1)))
    response = urllib2.urlopen(baseurl+"&page="+str(page))
    html = response.read()
    the_xml = clean_xml(html)
    fh.write(the_xml)
fh.close()

I decided to make the requests for the user-submitted tags in another python script. The script is a little too long to post here, but it basically iterated over all "track" nodes in the output of the last script, and parsed the results from a REST query of tags. Since I'm considerate, I put a long wait between each request for the over 100,000 songs. Even though I handled repeated tracks gracefully, it took days to finish. I used the pickle module to serialize the sum of data I got at regular intervals so a failure during the night of day 2 wouldn't have been catastrophic.

XML transformations and XPath
There is still a little bit of cleanup to do... I used various shell commands to remove all unnecessary elements from the XML documents and escape the characters that I forgot to escape. Then I had to organize the data by date so that I can do time series analysis. The script I used to accomplish this is as follows:

#!/usr/bin/env python -tt

from lxml import etree
import codecs

# read cleaned up track history XML
doc = etree.parse("escaped_processed.xml")

fh = codecs.open("bydate.xml", "a", encoding="utf-8")

# get all the dates (previously restricted to just month and year)
udates = list(set([date.text for date in doc.xpath("//date")]))

# create a new DOM tree to hang the transformation upon
root = etree.Element("bydate")

for cdate in udates:
    # element tags can't start with a number
    # add a "d" to it
    this = etree.SubElement(root, 'd' + cdate)
    # get all tracks listened to on that date
    these_tracks = [node for node in
                    doc.xpath("/alltags/track[date=" + cdate + "]")]
    # add the tracks to the DOM
    for itrack in these_tracks:
        this.append(itrack)

fh.write(etree.tostring(root, pretty_print=True))

Finally, I whipped up a quick script to sum the number of listens on a particular tag for each time interval.

At this time we have a file "playnumbymonth.csv" with the dates and total tracks listened to for that month that looks like this...

date,numlistens
03-2008,1422
10-2008,1394
05-2008,923
12-2009,640
10-2009,630
..........

and ("melted") file called "longformat.csv" that holds dates, tag names, and the number of tracks (played in that month) that contained the tag. It looks like this...

date,tag,number
03-2008,folk rock,1
03-2008,summer,1
03-2008,spoken word,2
03-2008,cute,5
03-2008,dance,11
..........

R analytics and visualization
First, to visualize the number of songs I’ve listened to over time, I had to import the "playnumbymonth.csv" dataset, parse the date with the lubridate package, make a "zoo" time series object out of the dataframe, and plot it.

library(zoo)
library(lubridate)

plays <- read.csv("playnumbymonth.csv", stringsAsFactors=FALSE)

# parse dates
plays$date <- parse_date_time(plays$date, "my")

#make time series object
tsplays <- read.zoo(plays)

#plot it with a LOWESS smooth curve
loline <- lowess(tsplays, f=.5)
plot(tsplays, main="Plays per month since 2008", ylab="Number of plays", xlab="Date")
lines(index(tsplays), loline$y, col='red', lwd=2)

The resulting plot looks like this:
Plays per month

While I was working with this data set, I wanted to check if there was any periodicity to my listening history (perhaps I listen to more music in the winter than I do in the summer). I briefly attempted to use seasonal decomposition and autocorrelation to try to detect this. No dice.

For the musical "diversity" and genre listening trends, I read in "longformat.csv", used reshape to aggregate (pivot) by tags until I had a huge matrix where each row was a month between 2008 and 2014, and each column was a last.fm tag. Then I used the vegan (vegetation analysis) package to take the Shannon diversity index of each month with respect to wealth and evenness of tags listened to:

long.tag.frame <- read.csv("longformat.csv", stringsAsFactors=FALSE)
long.tag.frame$date <- parse_date_time(long.frame$date, "my")

wide.frame <- data.frame(cast(long.tags.frame, date~tag))
# convert all NAs to zero
wide.frame[is.na(wide.frame)] <- 0

new.frame <- data.frame(wide.frame[,1])
new.frame$diversity <- diversity(wide.frame[,-1])

After some cleanup and "zoo" object creation, and LOWESS curve creation, the plot of the listening data and diversity indices looked like this:
Number of plays and variety

Visualizing how my music tastes have (appeared to) change over time was the best part. I created a diagonal matrix from the multiplicative inverse of number of tracks that I listened to each month and matrix-multiplied this with the wide tag matrix. The result of this computation yielded the proportion of songs I listened to each month that contained each tag.

I took a few choice tags corresponding to some of my favorite musical genres, put it in a new data frame ("tag.interest") and used the lattice package to visualize the trends.

tag.interest <- data.frame(dates)
tag.interest$Post.Punk <- prop.plays[,2227]
tag.interest$Indie <- prop.plays[,1413]
tag.interest$Punk <- prop.plays[,2270]
tag.interest$Coldwave <- prop.plays[,654]
tag.interest$Darkwave <- prop.plays[,762]
tag.interest$Twee <- prop.plays[,3003]
tag.interest$Indie.Pop <- prop.plays[,1422]
tag.interest$Hip.Hop <- prop.plays[,1337]

> names(tag.interest)
[1] "dates"     "Post.Punk" "Indie"     "Punk"      "Coldwave"  "Darkwave"  "Twee"      "Indie.Pop" "Hip.Hop"  

xyplot(read.zoo(tag.interest), type=c("l", "g"),
       ylab="Proportion of songs containing tag",
       main="Trends in musical genre listening habits",
       panel = function(x, y, col, ...) {
         panel.xyplot(x, y, col = "blue", ...)
         panel.loess(x, y, col = "red", lwd=3)
       })

This produced my favorite plot:
Genre listening trends

Looking at it, I remembered a period of time in 2009 that I listened to almost exclusively Hip-Hop music. I was also reminded that I got into the "coldwave" and "darkwave" genres rather recently and around the same time as each other in summer of 2011. Another neat result is that there is a fairly strong negative correlation between my "twee" music listening and my "darkwave" music listening history, as these genres are almost musical 'opposites'.

This had been a fun trip down memory lane for me. My only regret is that I didn't open my last.fm account sooner... as long as it was after a period in my childhood music-listening that I would be embarrassed to have on digital record.


To leave a comment for the author, please follow the link and comment on his blog: On The Lambda.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.