Visualising Wikipedia search statistics with R

August 6, 2011
By

(This article was first published on Expansed » R, and kindly contributed to R-bloggers)

I have been playing with R to parse html. After reading about visualising “fantasy football” search traffic with RGoogleTrends at The Log Cabin blog I decided to write a few functions to do similar things with Wikipedia search statistics.

This is what I have managed to come up with:

wikiStat <- function (query, lang = 'en',
                      monback = 12,
                      since = Sys.Date() ) {
 
  #load packages
  require(mondate)
  require(XML)
 
  namespace <- c("a" = "http://www.w3.org/1999/xhtml")
  wikidata <- data.frame()
 
  #iterate "monback" number of months back
  for (i in 1:monback) {
    #get number of days in a given month and create a vector
    curdate <- strptime(mondate(since) - (i - 1), "%Y-%m-%d")
    previous <- strptime(mondate(since) - (i - 2), "%Y-%m-%d")
    noofdays <- round(as.numeric(previous - curdate), 0)
    days <- seq(from = 1, to = noofdays, by = 1)
 
    #build url
    if(curdate$mon + 1 < 10)
    {
      dateurl <- paste(as.character(curdate$year + 1900), "0",
                       as.character(curdate$mon + 1), sep = "")
    }
    else
    {
      dateurl <- paste(as.character(curdate$year + 1900),
                       as.character(curdate$mon + 1), sep = "")
    }
 
  url <- paste("http://stats.grok.se/",
             lang, '/',
             dateurl, '/',
             query,
             sep = "")
  #get and parse a wikipedia statistics webpage
  wikitree <- xmlTreeParse(url, useInternalNodes=T)
 
  #find nodes specyfying traffic
  traffic <- xpathSApply(wikitree,"//a:li[@class='sent bar']/a:p",
                         xmlValue, namespaces = namespace)
 
  #edit obtained strings (sometimes its in the format
  # of e.g. 7.5k meaning 7500)
  traffic <- gsub("\\.", "", traffic)
  traffic <- gsub("k", "00", traffic)
  traffic <- as.numeric(traffic)
 
  #it seems that there is some kind of a bug in wikipedia statistics
  # and the results are shifted by one day in month - this is a fix
  if(length(traffic) > noofdays) {
    traffic <- traffic[2:length(traffic)]
  }
  #create daily dates relating to traffic vector
  #and create a dataframe
  days <- seq(from = 1, to = length(traffic), by = 1)
  yearmon <- rep(paste(curdate$year + 1900,
                       curdate$mon + 1, sep = "-"),
                 length(traffic))
  date  <- as.Date(paste(yearmon, days, sep = "-"), "%Y-%m-%d")
  wikidata <- rbind(wikidata, data.frame(date, traffic))
  }
 
  #remove rows that are missing (due to the bug?)
  wikidata <- wikidata[!is.na(wikidata$date),]
 
  #return dataframe
  return(wikidata)
}
 
wikiPlotStat <- function(wikitraffic,
                         title = "Wikipedia statistics") {
  require(ggplot2)
 
  #create a plot
  wikiplot <- ggplot() + geom_bar(aes(x = date, y = traffic,
                                      fill = traffic),
                                  stat = "identity",
                                  data = wikitraffic) +
                                  opts(title = title) 
 
  #...with no legend and a theme that fits colours of my blog ;)
  wikiplot <- wikiplot + theme_bw() + opts(legend.position = "none") 
 
  return(wikiplot)
}

With these two functions you can take a look at search traffic for any article you wish. For instance, we can take a look at the search statistics for “Financial crisis”. The wikiStat() function returns dataframe with the necessary data:

#look 40 months back from now
critraffic <- wikiStat("Financial_crisis", monback = 40)

To plot the data easily we can use the second function:

criplot <- wikiPlotStat(critraffic,
                "Wikipedia search traffic for 'Financial crisis'")
criplot

And this is the result:

You can clearly see the outbreak of the crisis in the second half of 2008, when Lehman Brothers collapsed. Since then people seem to be still willing to learn about the crisis.

Do you have any suggestions?

To leave a comment for the author, please follow the link and comment on his blog: Expansed » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.