Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# read data
lastfm <- read.delim("~/Downloads/bestof_2011_tsv/bestof_2011_releases.tsv")

Then I did some data cleanup, because one row just contained junk and some columns were unnecessary. I also removed all entries after row 100.

# remove row 541 'cause it's just junk
lastfm <- lastfm[-541,]
# remove unnecessary columns
lastfm <- lastfm[-c(3, 5)]
# remove all rows after 100
lastfm <- lastfm[-c(101:nrow(lastfm)) , ]


I did a search for missing values, but none were found.

which(lastfm == "NULL", arr.ind = TRUE)
which(is.na(lastfm), arr.ind = TRUE)

The XML-file contained information about artists location. So I loaded it and cleaned it up a bit. The location column was a bit messy so I edited manually in statas data editor, I figured it was the easiest way. I then read the edited data file back into R and combined that data.frame with the rest of the data from the TSV-file.

library(XML)
last.xml <- last.xml[-c(101:nrow(last.xml)) , ]
last.xml <- last.xml[-c(1,4,5,6,7,8,9)]
write.dta(last.xml, "stata", version = 7L)

library(foreign)
# combine data.frames
lastfm <- cbind(lastfm, location = last.xml$location) I tried plotting this data.frame with ggplot but the location variable contained 17 countries, which made a messy plot. Therefore I choose to group some countries under the label “other”. lastfm$location <- as.character(lastfm$location) lastfm$location[lastfm$location %in% c("Denmark", "Sweden")] <- "Sweden/Denmark" lastfm$location[lastfm$location %in% c("Germany", "France","Paris","Australia", "New Zealand", "Iceland","Brazil", "Scotland", "Democratic Republic of the Congo", "Romania","Belgium", "Netherlands")] <- "Other" I still wasn’t satisfied with the plot, because it wasn’t sorted after album plays. I tried quite a lot of different methods of sorting the data.frame before figuring out how to do it successfully with reorder(). lastfm$artist.name <- reorder(lastfm$artist.name, rowSums(lastfm[4]))  I wanted my plot to have readable decimal notation so I created my own x-breaks. library(scales) x.breaks <- cbreaks( c(0, max(lastfm$album.plays)), #range: 0 to album.plays max
pretty_breaks(10), # 10 ticks
labels = comma_format()) # create labels with commas, ie 10,000.


I also used my own custom colors for the plots legend, which I saved in a list before initiating ggplot2.

location.color <- c("Canada" = "#7b8dbf",
"Other" = "#f97850",
"Sweden/Denmark" = "#df72b6",
"UK" = "#57b894",
"USA" = "#4a4a4a"
)


Then, at last, I drew the plot with ggplot2.

library(ggplot2)
ggplot(lastfm, aes(artist.name,album.plays, fill=location)) +
geom_bar(stat="identity") +
coord_flip() + # flip x and y
xlab("Album Artist") +
ylab("Album plays") +
# Use the labels and breaks I defined earlier
scale_y_continuous(breaks = x.breaks$breaks, labels = x.breaks$labels) +
opts(title = "Last.fm top 100 albums 2011",
# Move the legend inside the plot to save space.
legend.position=c(.85, .5),
# Change it's background to white.
legend.background=theme_rect(fill="#ffffff")) +
# Use my custom color scale which I defined earlier.
scale_fill_manual("Artist homeland", values = location.color)


We can see that the plot is dominated by USA and UK and that Adele and Lady Gaga got exponentially more album plays than the rest. To give a summary of $location I used summary(). summary(as.factor(lastfm$location))


Which gave the following:

           Canada          Other     Sweden/Denmark      UK            USA
5             13              4             24             54