The Star Wars Grossing War

January 10, 2016
By

(This article was first published on Daniel's Blog, and kindly contributed to R-bloggers)

Motivation

I could finally made to the movies for watching the new Star Wars release this weekend. Although this episode wasn’t that spectacular, in my view, it did inspire some data seeking afterwards. I wanted to know how this film compares to others top movies in terms of worldwide grossing as well as within the Star Wars series.

Fortunately, there is a wealth–though incomplete–list of the top grossing films of all time at . Although the information is right in the front-page, I’d rather like something more visual teasing. So, I decided to see how it goes with *R* and the new *ggplot2* package release. Also, because I must scrap the data from the *Box Office* website, I will need a function to handle the HTML structure of those tables. The function `readHTMLTable()` from the *XML* package can certainly be an asset here.

The setup

First, let’s load the packages we’ll need.

library(XML)
library(ggplot2)

In what follows is my setup for using the readHTMLTable function to retrieve, cleaning, and arrange HTML tables in a data.frame format. I’d rather wrap everything in a single function, but keeping the three snippets apart is rather easy to make out.

The first function will pull out all tables on the webpage as a list of data.frames, and I’ll give them similar names.

GetTable <- function(t) {
table <- readHTMLTable(t)[[2]]
names(table) <- c("Rank", "Title", "Studio", "Worldwide", "Domestic", "DomesticPct", "Overseas", "OverseasPct", "Year")
boxdf <- as.data.frame(lapply(table[-1, ], as.character), stringsAsFactors=FALSE)
boxdf <- as.data.frame(boxdf, stringsAsFactors=FALSE)
boxdf <- transform(boxdf, Year = ifelse(Year==0, NA, Year))
return(boxdf)
}

The data will come dirty with lots of tags and marks, so a little janitor work will be necessary. The following code does just that.

CleanDataFrame <- function(boxdf) {
clean <- function(col) {
col <- gsub("$", "", col, fixed = TRUE)
col <- gsub("%", "", col, fixed = TRUE)
col <- gsub(",", "", col, fixed = TRUE)
col <- gsub("^", "", col, fixed = TRUE)
return(col)
}
boxdf <- sapply(boxdf, clean)
boxdf <- as.data.frame(boxdf, stringsAsFactors=FALSE)
return(boxdf)
}

The next snippet is the main piece. It will construct the URLs based on the number of pages we feed in and will call the two preceding functions.

BoxOfficeMojoScraper <- function(npages) {
# This line constructs the URLs
urls <- paste("http://boxofficemojo.com/alltime/world/?pagenum=", 1:npages, "&p=.htm", sep = "")
# The next line scrapes every table in the URLs formed
boxdf <- do.call("rbind", lapply(urls, GetTable))
# This does the janitor work
boxdf <- CleanDataFrame(boxdf)
# The next lines arrange the data to my needs
cols <- c(1, 4:9)
boxdf[, cols] <- sapply(boxdf[, cols], as.numeric)
boxdf$Studio <- as.factor(boxdf$Studio)
return(boxdf)
}

I’m scrapping the first 7 pages of the target address http://www.boxofficemojo.com/alltime/world/. It will bring missing values too, don’t worry for the time being.

npages <- 7
box <- BoxOfficeMojoScraper(npages)
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

Results

Our new acquired data is a data.frame with more than 620 rows or films, with the oldest dating back to 1939.

str(box)
## 'data.frame':	628 obs. of  9 variables:
##  $ Rank       : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title      : chr  "Avatar" "Titanic" "Jurassic World" "Star Wars: The Force Awakens" ...
##  $ Studio     : Factor w/ 38 levels "Art.","BV","Col.",..: 7 22 32 2 2 32 2 36 2 2 ...
##  $ Worldwide  : num  2788 2187 1669 1602 1520 ...
##  $ Domestic   : num  760 659 652 781 623 ...
##  $ DomesticPct: num  27.3 30.1 39.1 48.7 41 23.3 32.7 28.4 31.4 33.7 ...
##  $ Overseas   : num  2028 1528 1017 821 896 ...
##  $ OverseasPct: num  72.7 69.9 60.9 51.3 59 76.7 67.3 71.6 68.6 66.3 ...
##  $ Year       : num  2009 1997 2015 2015 2012 ...
# the oldest:
min(box$Year, na.rm=TRUE)
## [1] 1939

The following chart displays the grossing worldwide values for the top 25 ranked movies of all time. As it shines out, the Star Wars: The Force Awakens is doing pretty well worldwide. It’s ranked fourth now, but it just began to play in China this week, so it may unseat Titanic over the next weeks, and Avatar in the long run.

center

If you want to reproduce the very same plot decoration of this post, you’ll have to install the development version of SciencesPo package, and add + theme_538(legend="top") to the following code.

box2 <- subset(box, Rank<=25)

ggplot(box2) +
geom_bar(aes(x=reorder(Title, Worldwide), y=Worldwide, fill="Worldwide"), stat = "identity") +
geom_bar(aes(x=Title, y=Domestic, fill="Domestic"),alpha=.5, stat = "identity") +
scale_fill_manual(name="Grossing", values=c(Worldwide="#A6CEE3", Domestic="#386CB0")) +
coord_flip() + 
 labs(x=NULL, y=NULL, title="Top 25 Films by Worldwide Grosses (US$ Millions)")

Next, how The Force Awakens compares with other episodes of Star Wars?

center

# will search for Star Wars names within the data.frame:
box3 <- subset(box, grepl("^Star Wars", box$Title)|grepl("^Return of the Jedi", box$Title)|grepl("^The Empire Strikes Back", box$Title))

# will wrap the axis labels
wrap_20 <- function(x)gsub('(.{1,20})(\s|$)', '\1n', x)

ggplot(box3) +
geom_bar(aes(x=reorder(wrap_20(Title), Worldwide), y=Worldwide, fill="Worldwide"), stat = "identity") +
geom_bar(aes(x=wrap_20(Title), y=Domestic, fill="Domestic"), alpha=.5, stat = "identity") +
scale_fill_manual(name="Grossing", values=c(Worldwide="#A6CEE3", Domestic="#386CB0")) +
coord_flip() + 
labs(x=NULL, y=NULL, title="Star Wars Grosses (US$ Millions)")

The new release of Star Wars by Disney is making its way to the top films of all-time. Although the values are not in current currency, old films in the list may have grosses based on multiple releases, and that the The Force Awakens just began its journey.

To leave a comment for the author, please follow the link and comment on their blog: Daniel's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Mango solutions



RStudio homepage



Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training



http://www.eoda.de







ODSC

ODSC

CRC R books series











Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)