Wikipedia page views

January 16, 2013
By

(This article was first published on Quantifying Memory, and kindly contributed to R-bloggers)

Here I present an application that quantifies Wikipedia page views. It can visualise any topic in any language. It is (shamelessly) based on an application by the blogger Andrew Clark (pssguy), whose code is available here.

I have added:

  • multi language support
  • a moving average option
  • a regression option (on this see below)
Especially the regression functions are tailored to be useful for analysing public interest in memory events, but there is no reason why it might not also have other uses. The number of blog posts about a topic is sometimes taken as an indicator of public engagedness or even opinion; this type of search might provide a corrective to these efforts, as it is less open to overt manipulation. 

To use it, paste the last part of the Wikipedia url you want to examine into the box below. For instance to see the English language page about Tottenham Hotspur, enter:
identifier: Tottenham_Hotspur_F.C.
language: en
(url: http://en.wikipedia.org/wiki/Tottenham_Hotspur_F.C.)

I think the application may yield interesting findings, though of course also flippant ones, such as the data below which shows how Germans read about Tottenham more than Norwegians about Putin

For anyone feeling restricted by the iframe above, the application may also be accessed here:

The regression option is similar to one I use in my studies, and it controls for anniversaries and other events, such as elections. The model used controls for an increase in Wikipedia searches over time, at any date specified by the user, as well as during Russian elections.

The calculated weights are overlaid on the observed values – this allows the user to control that the dates entered are accurate, visualise the calculated trajectory, as well as identify ‘unexpected’ spikes in coverage.

The regression coefficients are displayed below the graph, to establish the degree of variation explained, as well as statistical significance.

It’s should be noted that (especially) Russian Wikipedia usage has grown steadily, meaning almost any topic will reveal an increase over time. For this reason I would limit Russian language searches to the last three or four years, during which usage appears to have stabilised somewhat.

variables:

  • a1: the anniversary date entered. The number of views on this day every year is controlled for
  • elections1: the date of Russian elections (since 1999)
  • elections2: an inverted quadratic expression measuring the distance in days from the closest election

to-do:

  • Control for a general increase in Wikipedia searches
  • make Russian elections in regression model optional/substitutable
  • Find a way to automatically overlay Yandex’s blogsearch results; this will control for deviations in how people write and read about a topic

To leave a comment for the author, please follow the link and comment on their blog: Quantifying Memory.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)