The Star Wars Grossing War

[This article was first published on Daniel's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


I could finally made to the movies for watching the new Star Wars release this weekend. Although this episode wasn’t that spectacular, in my view, it did inspire some data seeking afterwards. I wanted to know how this film compares to others top movies in terms of worldwide grossing as well as within the Star Wars series.

Fortunately, there is a wealth–though incomplete–list of the top grossing films of all time at . Although the information is right in the front-page, I’d rather like something more visual teasing. So, I decided to see how it goes with *R* and the new *ggplot2* package release. Also, because I must scrap the data from the *Box Office* website, I will need a function to handle the HTML structure of those tables. The function `readHTMLTable()` from the *XML* package can certainly be an asset here.

The setup

First, let’s load the packages we’ll need.


In what follows is my setup for using the readHTMLTable function to retrieve, cleaning, and arrange HTML tables in a data.frame format. I’d rather wrap everything in a single function, but keeping the three snippets apart is rather easy to make out.

The first function will pull out all tables on the webpage as a list of data.frames, and I’ll give them similar names.

GetTable <- function(t) {
table <- readHTMLTable(t)[[2]]
names(table) <- c("Rank", "Title", "Studio", "Worldwide", "Domestic", "DomesticPct", "Overseas", "OverseasPct", "Year")
boxdf <-[-1, ], as.character), stringsAsFactors=FALSE)
boxdf <-, stringsAsFactors=FALSE)
boxdf <- transform(boxdf, Year = ifelse(Year==0, NA, Year))

The data will come dirty with lots of tags and marks, so a little janitor work will be necessary. The following code does just that.

CleanDataFrame <- function(boxdf) {
clean <- function(col) {
col <- gsub("$", "", col, fixed = TRUE)
col <- gsub("%", "", col, fixed = TRUE)
col <- gsub(",", "", col, fixed = TRUE)
col <- gsub("^", "", col, fixed = TRUE)
boxdf <- sapply(boxdf, clean)
boxdf <-, stringsAsFactors=FALSE)

The next snippet is the main piece. It will construct the URLs based on the number of pages we feed in and will call the two preceding functions.

BoxOfficeMojoScraper <- function(npages) {
# This line constructs the URLs
urls <- paste("", 1:npages, "&p=.htm", sep = "")
# The next line scrapes every table in the URLs formed
boxdf <-"rbind", lapply(urls, GetTable))
# This does the janitor work
boxdf <- CleanDataFrame(boxdf)
# The next lines arrange the data to my needs
cols <- c(1, 4:9)
boxdf[, cols] <- sapply(boxdf[, cols], as.numeric)
boxdf$Studio <- as.factor(boxdf$Studio)

I’m scrapping the first 7 pages of the target address It will bring missing values too, don’t worry for the time being.

npages <- 7
box <- BoxOfficeMojoScraper(npages)
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion


Our new acquired data is a data.frame with more than 620 rows or films, with the oldest dating back to 1939.

## 'data.frame':	628 obs. of  9 variables:
##  $ Rank       : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title      : chr  "Avatar" "Titanic" "Jurassic World" "Star Wars: The Force Awakens" ...
##  $ Studio     : Factor w/ 38 levels "Art.","BV","Col.",..: 7 22 32 2 2 32 2 36 2 2 ...
##  $ Worldwide  : num  2788 2187 1669 1602 1520 ...
##  $ Domestic   : num  760 659 652 781 623 ...
##  $ DomesticPct: num  27.3 30.1 39.1 48.7 41 23.3 32.7 28.4 31.4 33.7 ...
##  $ Overseas   : num  2028 1528 1017 821 896 ...
##  $ OverseasPct: num  72.7 69.9 60.9 51.3 59 76.7 67.3 71.6 68.6 66.3 ...
##  $ Year       : num  2009 1997 2015 2015 2012 ...
# the oldest:
min(box$Year, na.rm=TRUE)
## [1] 1939

The following chart displays the grossing worldwide values for the top 25 ranked movies of all time. As it shines out, the Star Wars: The Force Awakens is doing pretty well worldwide. It’s ranked fourth now, but it just began to play in China this week, so it may unseat Titanic over the next weeks, and Avatar in the long run.


If you want to reproduce the very same plot decoration of this post, you’ll have to install the development version of SciencesPo package, and add + theme_538(legend="top") to the following code.

box2 <- subset(box, Rank<=25)

ggplot(box2) +
geom_bar(aes(x=reorder(Title, Worldwide), y=Worldwide, fill="Worldwide"), stat = "identity") +
geom_bar(aes(x=Title, y=Domestic, fill="Domestic"),alpha=.5, stat = "identity") +
scale_fill_manual(name="Grossing", values=c(Worldwide="#A6CEE3", Domestic="#386CB0")) +
coord_flip() + 
 labs(x=NULL, y=NULL, title="Top 25 Films by Worldwide Grosses (US$ Millions)")

Next, how The Force Awakens compares with other episodes of Star Wars?


# will search for Star Wars names within the data.frame:
box3 <- subset(box, grepl("^Star Wars", box$Title)|grepl("^Return of the Jedi", box$Title)|grepl("^The Empire Strikes Back", box$Title))

# will wrap the axis labels
wrap_20 <- function(x)gsub('(.{1,20})(\s|$)', '\1n', x)

ggplot(box3) +
geom_bar(aes(x=reorder(wrap_20(Title), Worldwide), y=Worldwide, fill="Worldwide"), stat = "identity") +
geom_bar(aes(x=wrap_20(Title), y=Domestic, fill="Domestic"), alpha=.5, stat = "identity") +
scale_fill_manual(name="Grossing", values=c(Worldwide="#A6CEE3", Domestic="#386CB0")) +
coord_flip() + 
labs(x=NULL, y=NULL, title="Star Wars Grosses (US$ Millions)")

The new release of Star Wars by Disney is making its way to the top films of all-time. Although the values are not in current currency, old films in the list may have grosses based on multiple releases, and that the The Force Awakens just began its journey.

To leave a comment for the author, please follow the link and comment on their blog: Daniel's Blog. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)