Using Ggplot2 to plot last.fm top 100 albums
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I began with downloading and importing the tab separated data file from last.fm (TSV).
# read data
lastfm <- read.delim("~/Downloads/bestof_2011_tsv/bestof_2011_releases.tsv")
Then I did some data cleanup, because one row just contained junk and some columns were unnecessary. I also removed all entries after row 100.
# remove row 541 'cause it's just junk lastfm <- lastfm[-541,] # remove unnecessary columns lastfm <- lastfm[-c(3, 5)] # remove all rows after 100 lastfm <- lastfm[-c(101:nrow(lastfm)) , ]
I did a search for missing values, but none were found.
which(lastfm == "NULL", arr.ind = TRUE) which(is.na(lastfm), arr.ind = TRUE)
The XML-file contained information about artists location. So I loaded it and cleaned it up a bit. The location column was a bit messy so I edited manually in statas data editor, I figured it was the easiest way. I then read the edited data file back into R and combined that data.frame with the rest of the data from the TSV-file.
library(XML)
last.xml <- xmlToDataFrame("~/Downloads/bestof_2011_xml/bestof_2011_releases.xml")
last.xml <- last.xml[-c(101:nrow(last.xml)) , ]
last.xml <- last.xml[-c(1,4,5,6,7,8,9)]
write.dta(last.xml, "stata", version = 7L)
# read stata-file
library(foreign)
last.xml <- read.dta(file="/Users/Kris/stata.dta")
# combine data.frames
lastfm <- cbind(lastfm, location = last.xml$location)
I tried plotting this data.frame with ggplot but the location variable contained 17 countries, which made a messy plot. Therefore I choose to group some countries under the label “other”.
lastfm$location <- as.character(lastfm$location)
lastfm$location[lastfm$location %in% c("Denmark", "Sweden")] <- "Sweden/Denmark"
lastfm$location[lastfm$location %in% c("Germany",
"France","Paris","Australia",
"New Zealand",
"Iceland","Brazil", "Scotland",
"Democratic Republic of the Congo",
"Romania","Belgium",
"Netherlands")] <- "Other"
I still wasn’t satisfied with the plot, because it wasn’t sorted after album plays. I tried quite a lot of different methods of sorting the data.frame before figuring out how to do it successfully with reorder().
lastfm$artist.name <- reorder(lastfm$artist.name, rowSums(lastfm[4]))
I wanted my plot to have readable decimal notation so I created my own x-breaks.
library(scales) x.breaks <- cbreaks( c(0, max(lastfm$album.plays)), #range: 0 to album.plays max pretty_breaks(10), # 10 ticks labels = comma_format()) # create labels with commas, ie 10,000.
I also used my own custom colors for the plots legend, which I saved in a list before initiating ggplot2.
location.color <- c("Canada" = "#7b8dbf",
"Other" = "#f97850",
"Sweden/Denmark" = "#df72b6",
"UK" = "#57b894",
"USA" = "#4a4a4a"
)
Then, at last, I drew the plot with ggplot2.
library(ggplot2)
ggplot(lastfm, aes(artist.name,album.plays, fill=location)) +
geom_bar(stat="identity") +
coord_flip() + # flip x and y
xlab("Album Artist") +
ylab("Album plays") +
# Use the labels and breaks I defined earlier
scale_y_continuous(breaks = x.breaks$breaks, labels = x.breaks$labels) +
# Add a plot title
opts(title = "Last.fm top 100 albums 2011",
# Move the legend inside the plot to save space.
legend.position=c(.85, .5),
# Change it's background to white.
legend.background=theme_rect(fill="#ffffff")) +
# Use my custom color scale which I defined earlier.
scale_fill_manual("Artist homeland", values = location.color)

We can see that the plot is dominated by USA and UK and that Adele and Lady Gaga got exponentially more album plays than the rest. To give a summary of $location I used summary().
summary(as.factor(lastfm$location))
Which gave the following:
Canada Other Sweden/Denmark UK USA
5 13 4 24 54
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.