Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have been playing with R to parse html. After reading about visualising “fantasy football” search traffic with RGoogleTrends at The Log Cabin blog I decided to write a few functions to do similar things with Wikipedia search statistics.

This is what I have managed to come up with:

wikiStat <- function (query, lang = 'en',
monback = 12,
since = Sys.Date() ) {

require(mondate)
require(XML)

namespace <- c("a" = "http://www.w3.org/1999/xhtml")
wikidata <- data.frame()

#iterate "monback" number of months back
for (i in 1:monback) {
#get number of days in a given month and create a vector
curdate <- strptime(mondate(since) - (i - 1), "%Y-%m-%d")
previous <- strptime(mondate(since) - (i - 2), "%Y-%m-%d")
noofdays <- round(as.numeric(previous - curdate), 0)
days <- seq(from = 1, to = noofdays, by = 1)

#build url
if(curdate$mon + 1 < 10) { dateurl <- paste(as.character(curdate$year + 1900), "0",
as.character(curdate$mon + 1), sep = "") } else { dateurl <- paste(as.character(curdate$year + 1900),
as.character(curdate$mon + 1), sep = "") } url <- paste("http://stats.grok.se/", lang, '/', dateurl, '/', query, sep = "") #get and parse a wikipedia statistics webpage wikitree <- xmlTreeParse(url, useInternalNodes=T) #find nodes specyfying traffic traffic <- xpathSApply(wikitree,"//a:li[@class='sent bar']/a:p", xmlValue, namespaces = namespace) #edit obtained strings (sometimes its in the format # of e.g. 7.5k meaning 7500) traffic <- gsub("\\.", "", traffic) traffic <- gsub("k", "00", traffic) traffic <- as.numeric(traffic) #it seems that there is some kind of a bug in wikipedia statistics # and the results are shifted by one day in month - this is a fix if(length(traffic) > noofdays) { traffic <- traffic[2:length(traffic)] } #create daily dates relating to traffic vector #and create a dataframe days <- seq(from = 1, to = length(traffic), by = 1) yearmon <- rep(paste(curdate$year + 1900,
curdate$mon + 1, sep = "-"), length(traffic)) date <- as.Date(paste(yearmon, days, sep = "-"), "%Y-%m-%d") wikidata <- rbind(wikidata, data.frame(date, traffic)) } #remove rows that are missing (due to the bug?) wikidata <- wikidata[!is.na(wikidata$date),]

#return dataframe
return(wikidata)
}

wikiPlotStat <- function(wikitraffic,
title = "Wikipedia statistics") {
require(ggplot2)

#create a plot
wikiplot <- ggplot() + geom_bar(aes(x = date, y = traffic,
fill = traffic),
stat = "identity",
data = wikitraffic) +
opts(title = title)

#...with no legend and a theme that fits colours of my blog ;)
wikiplot <- wikiplot + theme_bw() + opts(legend.position = "none")

return(wikiplot)
}

With these two functions you can take a look at search traffic for any article you wish. For instance, we can take a look at the search statistics for “Financial crisis”. The wikiStat() function returns dataframe with the necessary data:

#look 40 months back from now
critraffic <- wikiStat("Financial_crisis", monback = 40)

To plot the data easily we can use the second function:

criplot <- wikiPlotStat(critraffic,
"Wikipedia search traffic for 'Financial crisis'")
criplot

And this is the result: You can clearly see the outbreak of the crisis in the second half of 2008, when Lehman Brothers collapsed. Since then people seem to be still willing to learn about the crisis.

Do you have any suggestions?