R: A Quick Scrape of Top Grossing Films from boxofficemojo.com

January 13, 2012
By

(This article was first published on Consistently Infrequent » R, and kindly contributed to R-bloggers)

 

Introduction

I was looking at a list of the top grossing films of all time (available from boxofficemojo.com) and was wondering what kind of graphs I would come up with if I had that kind of data. I still don’t know what kind of graphs I’d construct other than a simple barplot but figured I’d at least get the basics done and then if I feel motivated enough I could revisit this in the future.

Objective

Scrape the information available on http://boxofficemojo.com/alltime/world into R and make a simple barplot.

Solution

This is probably one of the easier scraping challenges. The function readHTMLTable() from the XML package does all the hard work. We just point the url of the page we’re interested in and feed it into the function. The function then pulls out all tables on the webpage as a list of data.frames. We then choose which data.frame we want. Here’s a single wrapper function:

box_office_mojo_top <- function(num.pages) {
  # load required packages
  require(XML)

  # local helper functions
  get_table <- function(u) {
    table <- readHTMLTable(u)[[3]]
    names(table) <- c("Rank", "Title", "Studio", "Worldwide.Gross", "Domestic.Gross", "Domestic.pct", "Overseas.Gross", "Overseas.pct", "Year")
    df <- as.data.frame(lapply(table[-1, ], as.character), stringsAsFactors=FALSE)
    df <- as.data.frame(df, stringsAsFactors=FALSE)
    return(df)
  }
  clean_df <- function(df) {
    clean <- function(col) {
      col <- gsub("$", "", col, fixed = TRUE)
      col <- gsub("%", "", col, fixed = TRUE)
      col <- gsub(",", "", col, fixed = TRUE)
      col <- gsub("^", "", col, fixed = TRUE)
      return(col)
    }

    df <- sapply(df, clean)
    df <- as.data.frame(df, stringsAsFactors=FALSE)
    return(df)
  }

  # Main
  # Step 1: construct URLs
  urls <- paste("http://boxofficemojo.com/alltime/world/?pagenum=", 1:num.pages, "&p=.htm", sep = "")

  # Step 2: scrape website
  df <- do.call("rbind", lapply(urls, get_table))

  # Step 3: clean dataframe
  df <- clean_df(df)

  # Step 4: set column types
  s <- c(1, 4:9)
  df[, s] <- sapply(df[, s], as.numeric)
  df$Studio <- as.factor(df$Studio)

  # step 5: return dataframe
  return(df)
}

Which we use as follows:

num.pages <- 5
df <- box_office_mojo_top(num.pages)

head(df)
# Rank Title Studio Worldwide.Gross Domestic.Gross Domestic.pct Overseas.Gross Overseas.pct Year
# 1 1 Avatar Fox 2782.3 760.5 27.3 2021.8 72.7 2009
# 2 2 Titanic Par. 1843.2 600.8 32.6 1242.4 67.4 1997
# 3 3 Harry Potter and the Deathly Hallows Part 2 WB 1328.1 381.0 28.7 947.1 71.3 2011
# 4 4 Transformers: Dark of the Moon P/DW 1123.7 352.4 31.4 771.4 68.6 2011
# 5 5 The Lord of the Rings: The Return of the King NL 1119.9 377.8 33.7 742.1 66.3 2003
# 6 6 Pirates of the Caribbean: Dead Man's Chest BV 1066.2 423.3 39.7 642.9 60.3 2006

str(df)
# 'data.frame': 475 obs. of 9 variables:
# $ Rank : num 1 2 3 4 5 6 7 8 9 10 ...
# $ Title : chr "Avatar" "Titanic" "Harry Potter and the Deathly Hallows Part 2" "Transformers: Dark of the Moon" ...
# $ Studio : Factor w/ 35 levels "Art.","BV","Col.",..: 7 20 33 19 16 2 2 2 2 33 ...
# $ Worldwide.Gross: num 2782 1843 1328 1124 1120 ...
# $ Domestic.Gross : num 760 601 381 352 378 ...
# $ Domestic.pct : num 27.3 32.6 28.7 31.4 33.7 39.7 39 23.1 32.6 53.2 ...
# $ Overseas.Gross : num 2022 1242 947 771 742 ...
# $ Overseas.pct : num 72.7 67.4 71.3 68.6 66.3 60.3 61 76.9 67.4 46.8 ...
# $ Year : num 2009 1997 2011 2011 2003 ...

We can even do a simple barplot of the top 50 films by worldwide gross (in millions) :


 require(ggplot2)
 df2 <- subset(df, Rank<=50)
 ggplot(df2, aes(reorder(Title, Worldwide.Gross), Worldwide.Gross)) +
   geom_bar() +
   opts(axis.text.x=theme_text(angle=0)) +
   opts(axis.text.y=theme_text(angle=0)) +
   coord_flip() +
   ylab("Worldwise Gross (USD $ millions)") +
   xlab("Title") +
   opts(title = "TOP 50 FILMS BY WORLDWIDE GROSS")


To leave a comment for the author, please follow the link and comment on his blog: Consistently Infrequent » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.