Notes on a Scandal – When Jimmy beat Katy

[This article was first published on PremierSoccerStats » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

No the title doesn’t refer to how Katy Perry suffered at another of Jimmy Savile’s sexual predelictions, although these are two of  the participants. I’ll get to the details later

Just over a year ago, I reflected on the relative wiki searches of leading female singing celebrities, including Ms Perry. In the light of the recent Jimmy Savile scandal, I thought to revisit the area.

For the first post, I relied on code from a now-defunct web site and had not examined the raw data. It now appears to me as though wiki are not providing the information in the same way. The good news is that they offer a web page with daily searches for each month in JSON format, which actually simplifies matters

For this exercise, I have produced a function which collects and tabulates data for a set of people, produces graphs of their individual daily count data from the beginning of 2008 onwards and creates a group graph within a specified date range. The code is shown at the bottom of the page

Here is some of the output for some of the people mentioned during the scandal coverage

Savile, naturally, leads the way with ex-glam rock star, Gary Glitter, following. This probably reflects his generally greater fame and the severity of the allegations against him compared with DJ, Dave Lee Travis, and dead actor, Wilfrid Brambell

Now for the summary table. The difference between median and mean reflects the situation of steady daily searches punctuated by leaps when publicity occurs

Interestingly, the scandal has not produced the maximum search count for any of the four.

  • Dave Lee Travis peaked when Burmese pro-democracy leader Aung San Suu Kyi said his World Service programme had given her a lifeline
  • Over the timespan of the scandal, Savile’s travails in terms of searches are significant but his death sparked the individually highest rate
  • A TV show, detailing a feud between Brambell and his co-star of “Steptoe and Son”, Harry H Corbett, led to the former’s highest search on Wikipedia

Glitter’s graph shows several peaks before this month representing chronologically; his release from Thai jail and attempt to avoid returning to the UK; the mockumentary, “The Execution of Gary Glitter” shown on Channel 4; and incorrect rumours that he was planning a new tour

So how did Jimmy beat Katy? With a max search almost double her highest of 101,922

?View Code RSPLUS
# Packages required
library(RJSONIO) # acquiring and parsing data
library(ggplot2) # graphs
library(plyr) # creation of summary data
# create dataframes for all and summary data
allData <- data.frame(count=numeric(),date=character(),name=character())
summaryata <- data.frame(name=character(),mean=numeric(),median=numeric(),max=numeric(),maxdate=character()) #maxdate=date() causes error
# create variables for url
month <- c("01","02","03","04","05","06","07","08","09","10","11","12")
year <- c(2008:2012)
# function with default dates for comparison graph
wikiFun <- function(person, startDate="2012-09-01",endDate="2012-11-01") {
  for(k in 1:length(person)) {
    # create dataframe for individual records
    df <- data.frame(count=numeric()) 
    for (i in 1:length(year)) {
      for (j in 1:length(month)) {
        url <- paste0("",year[i],month[j],"/",person[k]) <- readLines(url, warn="F") 
        rd  <- fromJSON(
        rd.views <- rd$daily_views 
        df <- rbind(df,
    # create a df with all peoples search counts by day
    df$date <-  as.Date(rownames(df))
    df$name <- person[k]
    colnames(df) <- c("count","date","name")
    df <- arrange(df,date)
    allData <- rbind(allData,df)
    # set title display and save individual's graph
    theTitle <- paste0("Daily Wikipedia searches for ",person[k])
    q <- ggplot(subset(df,df$count>0),aes(x=date,y=count))+geom_point()+xlab("")+ylab("")+ggtitle(theTitle) # individual plot prints to screen
       fname <- paste0("ws_",gsub(" ","",person[k]),".png")
  # display and save group graph using log scale for counts
  p <- ggplot(subset(allData,count>0&date>=as.Date(startDate, "%Y-%m-%d")&date<=as.Date(endDate, "%Y-%m-%d")),aes(x=date,y=count, colour=name))+geom_line()+xlab("")+ylab("")+ggtitle("Comparison of Daily Wikipedia searches")  + coord_trans(y="log2") #+scale_y_continuous(formatter=comma) caused error
  # calculate summaries , display and save
  summaryData <- ddply(subset(allData,count>0),.(name), summarize, mean=mean(count), median=median(count), max=max(count), max_date=date[which.max(count)] )
names <- c("Gary Glitter","Jimmy Savile","Dave Lee Travis","Wilfrid Brambell")

To leave a comment for the author, please follow the link and comment on their blog: PremierSoccerStats » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)