Peace through Music. Country clustering using R and the last.fm API

March 3, 2013
By

(This article was first published on Rcrastinate, and kindly contributed to R-bloggers)

last.fm is an internet radio and music suggestion service. Registered users can also use last.fm to 'scrobble' tracks they've been listening to. last.fm then keeps track of a user's statistics in terms of top artists, albums and tracks.

Luckily, last.fm also has an API which is accessible as soon as you get a key for it. Thanks to this API, there are lot of cool web-based applications for last.fm.

Today, I want to show you a few little things we can do with this API using R. I used (and modified) the R package RLastFM by Greg Hirson (thanks again, Greg!) to access the API and get the information.

I had the idea to group countries based on the listening habits ('scrobbles') of the people living there. Hierarchical clustering is the way to go here, I guess. As distances, we could just use the number of overlapping artists in the top 50 artists of each country.

First, we will need a function to access the API. This is just a convinience function for the already great working functions by Greg Hirson.

library(RLastFM)

get.country.artists <- function (country) {
  geo.getTopArtists(country)$artist }

Now, we select some countries (I selected all OECD countries, that's kind of arbitrary, but it's a start). Note, that the country names are defined by the ISO 3166-1 country names standard.

oecd.countries <- c("Belgium", "Denmark", "Germany", "France", "Greece", "Ireland", "Iceland", "Italy", "Canada", "Luxembourg", "Netherlands", "Norway", "Austria", "Portugal", "Sweden", "Switzerland", "Spain", "Turkey", "USA", "United Kingdom", "Japan", "Finland", "Australia", "New Zealand", "Mexico", "Czech Republic", "Korea, Republic of", "Hungary", "Poland", "Slovakia", "Chile", "Slovenia", "Israel", "Estonia")

Now, I access the last.fm API and put the results into a list.

countries <- sort(oecd.countries)

art.list <- list()
for (coun in countries) {
  cat(coun,"\n")
  art.list[[coun]] <- get.country.artists(coun) }

Afterwards, we need to create distance matrix based on the number of overlapping artists of two countries. First, I define a function to intersect two artist lists:

intersect.countries <- function (country1.artists, country2.artists) {
  length(intersect(country1.artists, country2.artists)) }

Now, I use the function on every possible pair of countries, write the results into a matrix and convert this matrix into a distance matrix.

result.mat <- c()
for (coun in countries) {
  new.vec <- c()
  for (i in 1:length(countries)) {
    new.dist <- 1 - (intersect.countries(art.list[[coun]], art.list[[countries[i]]]) / 50)
    new.vec <- c(new.vec, new.dist) }
  result.mat <- rbind(result.mat, new.vec) }
colnames(result.mat) <- countries
rownames(result.mat) <- countries
dists <- as.dist(result.mat, diag = T, upper = T)

Now, I'm doing the hierarchical clustering. I'm chosing the Ward method.

dists.clust <- hclust(dists, method = "ward")

And now for the plot (finally!)...

plot(dists.clust, main = "Clustering Dendogram, Method: Ward", xlab = "Similarities based on number of overlapping artists in top 50 artists", sub = "", cex = 0.9)

(click to be able to read anything)

It makes sense, doesn't it? Countries with many overlapping artists in the top 50 share one branch of the clustering tree. Other groups of countries are 'clustering in' later. In the right-most branch, large portions of Scandinavia (except Iceland) are clustering together. For some countries, I don't have an explanation (Iceland and Portugal?).

Currently, I'm experimenting with some visualization technique with the nice R maps package.

last.fm also supplies metro charts, where for specific cities, there are extra charts. Let's play around with it. First, we gonna need some new functions (these are adaptations from the RLastFM package and you gonna need to insert your own API key to make them work).

get.all.metros <- function (country, lastapi = RLastFM:::baseurl) {
xpathSApply(xmlParse(getForm(lastapi, method = "geo.getMetros", country = country, api_key = <your_key_here>), asText = T), "//metro/name", xmlValue) }

p.geo.getMetroArtistChart <- function (f) {
doc = xmlParse(f, asText = T)
list(artist = xpathSApply(doc, "//artist/name", xmlValue),
playcount = xpathSApply(doc, "//artist/listeners", xmlValue)) }
get.metro.artist <- function (metro, country = "germany", n = 100) {
p.geo.getMetroArtistChart(
getForm(RLastFM:::baseurl,
method = "geo.getMetroArtistChart",
country = country,
metro = metro,
limit = n,
api_key = <your_key_here>)) }

Now, let's use them to extract all metros supported in Germany and France. Afterwards, build two lists with metro charts.

de.metros <- get.all.metros(country = "germany")
fr.metros <- get.all.metros(country = "france")

build.metro.chart.list <- function (metros, country) {
metro.chart.list <- list()
for (metro in metros) {
cat(metro, "\n")
metro.chart.list[[metro]] <- get.metro.artist(metro, country = country) }
metro.chart.list }
de.metro.charts <- build.metro.chart.list(get.all.metros(country = "germany"), "germany")
fr.metro.charts <- build.metro.chart.list(get.all.metros(country = "france"), "france")

Now, load the maps package and the dataset of cities that comes with it. Then, draw Germany and France.

library(maps)
data(world.cities)
map(database = "world", regions = c("Germany", "France"), exact = T)


Here comes the fun part: Look into world.citites for each metro and write the top artist of each metro at the location of the city (under the city's name). Please note, that there are two Frankfurts and two Lilles' in world.cities. I have to select the correct ones.

for (city in names(de.metro.charts)) {
  city.info <- world.cities[world.cities$name == city,]
  if (city.info$name[1] == "Frankfurt") city.info <- city.info[1,]
  text(x = city.info$long, y = city.info$lat, labels = city.info$name, cex = .6)
  text(x = city.info$long, y = city.info$lat - 0.25,
       labels = de.metro.charts[[city]]$artist[1],
       col = "#FF0000FF", cex = .6)
}

for (city in names(fr.metro.charts)) {
  city.info <- world.cities[world.cities$name == city,]
  if (city.info$name[1] == "Lille") city.info <- city.info[2,]
  text(x = city.info$long, y = city.info$lat, labels = city.info$name, cex = .6)
  text(x = city.info$long, y = city.info$lat - 0.25,
       labels = fr.metro.charts[[city]]$artist[1],
       col = "#FF0000FF", cex = .6)
}


(click to enlarge)

So much for today, I'm too shocked by Coldplay in whole Germany to go on :)



To leave a comment for the author, please follow the link and comment on his blog: Rcrastinate.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.