# Visualising Wikipedia search statistics with R

August 6, 2011
By

(This article was first published on Expansed » R, and kindly contributed to R-bloggers)

I have been playing with R to parse html. After reading about visualising “fantasy football” search traffic with RGoogleTrends at The Log Cabin blog I decided to write a few functions to do similar things with Wikipedia search statistics.

This is what I have managed to come up with:

``` wikiStat <- function (query, lang = 'en',                       monback = 12,                       since = Sys.Date() ) {     #load packages   require(mondate)   require(XML)     namespace <- c("a" = "http://www.w3.org/1999/xhtml")   wikidata <- data.frame()     #iterate "monback" number of months back   for (i in 1:monback) {     #get number of days in a given month and create a vector     curdate <- strptime(mondate(since) - (i - 1), "%Y-%m-%d")     previous <- strptime(mondate(since) - (i - 2), "%Y-%m-%d")     noofdays <- round(as.numeric(previous - curdate), 0)     days <- seq(from = 1, to = noofdays, by = 1)       #build url     if(curdate\$mon + 1 < 10)     {       dateurl <- paste(as.character(curdate\$year + 1900), "0",                        as.character(curdate\$mon + 1), sep = "")     }     else     {       dateurl <- paste(as.character(curdate\$year + 1900),                        as.character(curdate\$mon + 1), sep = "")     }     url <- paste("http://stats.grok.se/",              lang, '/',              dateurl, '/',              query,              sep = "")   #get and parse a wikipedia statistics webpage   wikitree <- xmlTreeParse(url, useInternalNodes=T)     #find nodes specyfying traffic   traffic <- xpathSApply(wikitree,"//a:li[@class='sent bar']/a:p",                          xmlValue, namespaces = namespace)     #edit obtained strings (sometimes its in the format   # of e.g. 7.5k meaning 7500)   traffic <- gsub("\\.", "", traffic)   traffic <- gsub("k", "00", traffic)   traffic <- as.numeric(traffic)     #it seems that there is some kind of a bug in wikipedia statistics   # and the results are shifted by one day in month - this is a fix   if(length(traffic) > noofdays) {     traffic <- traffic[2:length(traffic)]   }   #create daily dates relating to traffic vector   #and create a dataframe   days <- seq(from = 1, to = length(traffic), by = 1)   yearmon <- rep(paste(curdate\$year + 1900,                        curdate\$mon + 1, sep = "-"),                  length(traffic))   date  <- as.Date(paste(yearmon, days, sep = "-"), "%Y-%m-%d")   wikidata <- rbind(wikidata, data.frame(date, traffic))   }     #remove rows that are missing (due to the bug?)   wikidata <- wikidata[!is.na(wikidata\$date),]     #return dataframe   return(wikidata) }   wikiPlotStat <- function(wikitraffic,                          title = "Wikipedia statistics") {   require(ggplot2)     #create a plot   wikiplot <- ggplot() + geom_bar(aes(x = date, y = traffic,                                       fill = traffic),                                   stat = "identity",                                   data = wikitraffic) +                                   opts(title = title)     #...with no legend and a theme that fits colours of my blog 😉   wikiplot <- wikiplot + theme_bw() + opts(legend.position = "none")     return(wikiplot) } ```

With these two functions you can take a look at search traffic for any article you wish. For instance, we can take a look at the search statistics for “Financial crisis”. The wikiStat() function returns dataframe with the necessary data:

``` #look 40 months back from now critraffic <- wikiStat("Financial_crisis", monback = 40) ```

To plot the data easily we can use the second function:

``` criplot <- wikiPlotStat(critraffic,                 "Wikipedia search traffic for 'Financial crisis'") criplot ```

And this is the result: You can clearly see the outbreak of the crisis in the second half of 2008, when Lehman Brothers collapsed. Since then people seem to be still willing to learn about the crisis.

Do you have any suggestions?

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tags: , ,