Visualizing box office revenue by genre

After having watched Justice League in cinema, I was impressed by all of the special effects and how good they were. I started wondering myself: How much does a movie like that cost? And most importantly, how big is the box-office revenue for this kind of blockbuster? I found an answer in The Numbers. I have then decided to make a database from the data available on this website. I have retrieved the 500th biggest movie budgets. Initially I just had a database with 5 variables on movies:
• the release date
• the name
• the production budget
• the dosmestic gross
• the worldwide gross
Thereafter, I crossed sources to get more variables. Data was scrapped on Wikipedia and IMDb. We finally get a dataset with 30 variables such as lists of actors, affiches url, distributions, rate and the number of raters from IMDb , etc…
You can find a complete description of the dataset on GitHub. All the data was scrapped via the package rvest.

In this post, I describe the different steps leading to the treemap:


First of all we read the data.

db = read.csv("",
              stringsAsFactors = FALSE)
#You can excecute the following line to have more information about the variable type.

Then we want to transform variables related to money in numeric variables and the movie realease dates in date variable using tidyverse.


db = db %>%
        mutate( Release.Date = as.Date(Release.Date, "%m/%d/%Y"), 
                Running.time = as.numeric(stringr::str_sub(Running.time,1,3)),
                Rate = as.numeric(Rate),
                Raters = as.numeric(gsub(",", "", Raters)),
                Production.Budget = as.numeric(gsub("[,$]", "",
                Domestic.Gross = as.numeric(gsub("[,$]", "",
                Worldwide.Gross = as.numeric(gsub("[,$]", "",
                                                 Worldwide.Gross)) ) %>%

The dataset looks better. As you have seen on top of this post. We want to design a treemap chart to visualize box-office revenue by genre. Let’s see how many movie genres are present in the data frame:

UniqueGenres = unique(db$Genres)
## [1] 224
head(UniqueGenres, 5)
## [1] "Action Adventure Fantasy Sci-Fi"                  
## [2] "Action Adventure Sci-Fi"                          
## [3] "Action Crime Thriller"                            
## [4] "Adventure Drama Fantasy Mystery"                  
## [5] "Animation Adventure Comedy Family Fantasy Musical"

There are 224 combinations of genres, which is way too many combinations. We need to reduce them in a way that each movie has 2 genres at the most: A main genre and a subgenre.


Let’s start with a simple barplot to visualize the most-represented genre from the 224 combinations.


all_genres = separate_rows(db %>% 
                           group_by(Genres) %>% 
                           select(Genres) %>% 
                           filter(row_number() ==1),
                           Genres, sep="[[:space:]]")

name_order = names(sort(table(all_genres)))

ggplot(all_genres, aes(Genres)) +
                theme_minimal( ) + 
        geom_bar( stat = "count", fill="#007acc" ) +
        coord_flip() +
        scale_x_discrete(limits = name_order)

We see that Adventure and Action are the most important genres, followed by those between Comedy and Sci-fi. The genres that come after Sci-fi are present in less than 60 combinations of genres. Hence we will consider them as subgenres. We have 8 main genres:
• Adventure
• Action
• Comedy
• Drama
• Family
• Thriller
But we also know that Sci-Fi and Fantasy can be seen as subgenres from Adventure or Action. Therefore, we finally keep 6 genres.
We have to check that all movies can have a main genre from the 6 genres that we have choosen. For that, we simply check that each combination have at least one of the main genre :

mainGenres= paste(c("Adventure", "Action",  "Comedy", 
                    "Drama", "Family", "Thriller"),

# grepl returns true for each genre combination if at least one of the main genre is present
length(grepl(mainGenres, db$Genres))/length(db$Genres)
## [1] 1

Apparently, this is the case 🙂


We finally add a main genre to all movies.
Be careful, The main genre of each movie will depend on the order in which you attribute the main genre. So the final shape of the output will depend on this step.

                   "Family", db$Genres)

db$Genresl1=ifelse(grepl("Drama", db$Genresl1),
                   "Drama", db$Genresl1)

db$Genresl1=ifelse( grepl("Thriller", db$Genresl1),
                    "Thriller", db$Genresl1)

db$Genresl1=ifelse(grepl("Action", db$Genresl1),
                   "Action", db$Genresl1)

db$Genresl1 =ifelse(grepl("Adventure", db$Genresl1),
                    "Adventure", db$Genresl1)

db$Genresl1=ifelse(grepl("Comedy", db$Genresl1),
                   "Comedy", db$Genresl1)

Now that the main genre were attributed, let’s focus on the subgenre.


We have seen that only 6 genres could be considered as main genres. However, in this part we will consider that all genres can be considered as subgenres. Now one of the difficulties is to decide which subgenre to select when there is more than one option. Association rules can help us in this task. We can see which subgenres are the most present for each genre and their level of dependency.


Let’s analyze the different genre combinations through an association rule analysis. We need first to read data as transaction. For that we use the package arules.

#no duplicate combinations!
item_genres = read.transactions("",
                                format = "basket", sep=":")

In this post, we will focus ourselves on 2 association rule indicators: the support and the confidence.
Support and confidence are displayed like the result bellow when the function arules::rules is used.

##     lhs              rhs      support     confidence lift     count
## [1] {Documentary} => {Drama}  0.004444444 1.0000000  2.777778  1   
## [2] {War}         => {Drama}  0.057777778 0.9285714  2.579365 13   
## [3] {History}     => {Drama}  0.080000000 0.9473684  2.631579 18   
## [4] {Animation}   => {Family} 0.208888889 0.9591837  2.731852 47

Support indicates how frequently genres in columns lhs and rhs appear together in the 224 combinations. The first row of the result above means that War and Drama appear together in 5,78% of combinations.

Confidence is an indication of how often the rule has been found to be true. It can also be seen as a conditional probability. { X => Y } means P(Y | X). This is the probability that the genre Y is also present when we already know that genre X is present. { War => Drama } = 0.929 from the second line of the result above means that Drama will be present in 92,9% of combination where War is present.
But be carefull, this relation is not neccesarly true in the opposite direction!

To see all association rules starting from a confidence level of 30% between 2 genres we write:

rules = apriori(item_genres, 
                confidence=0.3, minlen=2, maxlen=2)  )
ins_rules = inspect(rules) 


If we want to focus on the relationship between subgenres and main genres, we can filter the rhs columns.

mainGenres = unlist(strsplit(mainGenres, "|", fixed = TRUE))
ins_rules = ins_rules %>% 
        #removing the arrow =>
        .[,-2] %>%
        #removing the brackets for both columns, lhs and rhs
        mutate(lhs = trimws(gsub("\\{|\\}","",lhs)),
               rhs = trimws(gsub("\\{|\\}","",rhs))) %>%
        filter(rhs %in% mainGenres) %>%
        group_by(lhs) %>%
        filter(row_number() == 3) %>%
        arrange(lhs, desc(confidence))

## # A tibble: 17 x 6
## # Groups:   lhs [17]
##          lhs       rhs    support confidence      lift count
##        <chr>     <chr>      <dbl>      <dbl>     <dbl> <dbl>
##  1 Adventure    Action 0.29333333  0.5739130 1.1739130    66
##  2 Animation Adventure 0.16888889  0.7755102 1.5173026    38
##  3 Biography Adventure 0.01333333  0.3333333 0.6521739     3
##  4    Comedy Adventure 0.18666667  0.5121951 1.0021209    42
##  5     Crime    Comedy 0.04888889  0.3548387 0.9736428    11
##  6     Drama Adventure 0.13333333  0.3703704 0.7246377    30
##  7    Family Adventure 0.24888889  0.7088608 1.3869015    56
##  8   Fantasy    Action 0.12888889  0.3866667 0.7909091    29
##  9   History Adventure 0.03555556  0.4210526 0.8237986     8
## 10   Musical    Family 0.06222222  0.8750000 2.4920886    14
## 11   Mystery Adventure 0.06666667  0.5000000 0.9782609    15
## 12   Romance    Family 0.05333333  0.3000000 0.8544304    12
## 13    Sci-Fi    Family 0.08888889  0.3125000 0.8900316    20
## 14     Sport    Family 0.02222222  0.5000000 1.4240506     5
## 15  Thriller Adventure 0.13777778  0.4305556 0.8423913    31
## 16       War Adventure 0.02222222  0.3571429 0.6987578     5
## 17   Western Adventure 0.02222222  0.6250000 1.2228261     5


We create a new variable that we named: withoutMainGenres. This variable is the combination of genres without the main genre. If a movie has the combination: “Drama War Action Biography” and his main genre is “Drama”, then value of withoutMainGenres will be “War Action Biography”. If it’s not clear enough, I suggest that you run the code and to compare the variables withoutMainGenres and Genres. Once this new variable is made, we draw again a barplot to see the ditribution of genres.

db$withoutMainGenres = trimws(mapply(gsub, db$Genresl1, "", db$Genres))

all_genres = separate_rows(db %>% 
                           group_by(withoutMainGenres) %>% 
                           select(withoutMainGenres) %>% 
                           filter(row_number() ==1),
                           sep="[[:space:]]") %>% 
             rename( Genres=withoutMainGenres) %>%

name_order = names(sort(table(all_genres)))

ggplot(all_genres, aes(Genres)) +
                theme_minimal( ) + 
        geom_bar( stat = "count", fill="#007acc" ) +
        coord_flip() +
        scale_x_discrete(limits = name_order)

We see that there are still a lot of adventure movies. We use the result seen in the association rules and the barplot to make the subgenres.
We begin with the genre Animation because we want to regroup all of these movies in the same category. Then we add subgenres in an ascending order, from the less important to the most one.
However, movies from musical, music and horror genres are added at the end of the script because the attribution of these genres for the movie in our dataset is questionable.

                   "Animation", db$withoutMainGenres)
                   "Documentary", db$Genresl2)
db$Genresl2=ifelse(grepl("Biography", db$Genresl2), 
                   "Biography", db$Genresl2)
                   "Western", db$Genresl2)
                   "Sport", db$Genresl2)
                   "War", db$Genresl2)
                   "Mystery", db$Genresl2)
                   "Romance", db$Genresl2)
                   "Crime", db$Genresl2)
                   "Drama", db$Genresl2)
                   "Fantasy", db$Genresl2)
                   "Sci-Fi", db$Genresl2)
                   "Comedy", db$Genresl2)
                   "Thriller", db$Genresl2)
                   "Adventure", db$Genresl2)
                   "Musical", db$Genresl2)
                   "Music", db$Genresl2)
                   "Horror", db$Genresl2)
                   db$Genresl1, db$Genresl2)

Now that we have our 2 levels of genres. We can build our treemap!


To design the treemap, we need to regroup movies by main genres and subgenres, then we sum their Worlwide Gross revenue.

summary.Genre = db %>%
        group_by(Genresl1, Genresl2) %>%
        summarise(Sum_Gross = sum(Worldwide.Gross))

Finally we design the treemap using ggplot2 and treemapify:


ggplot(summary.Genre, aes(area = Sum_Gross ,
                          fill = Genresl1, label = Genresl2,
                          subgroup =Genresl1)) +
        geom_treemap() +
        geom_treemap_subgroup_border() +
        geom_treemap_subgroup_text(place = "centre", 
                                   grow = T, 
                                   alpha = 0.5, 
                                   colour = "black", 
                                   fontface = "italic", 
                                   min.size = 0) +
        geom_treemap_text(colour = "white", 
                          place = "topleft", 
                          reflow = T)

Here we have a first result but we can do better by adding some interactivity.


Let’s add some interactivity using the package highcharter. We use the github version (there are more functions).


hctreemap2(data = db,
           group_vars = c("Genresl1", "Genresl2"),
           size_var = "Worlwide.Gross",
           color_var = "Genresl2",
           layoutAlgorithm = "squarified",
           levelIsConstant = FALSE,
           levels = list(
                   list(level = 1, 
                        dataLabels = list(enabled = TRUE)),
                   list(level = 2, 
                        dataLabels = list(enabled = FALSE))
           )) %>% 
        hc_tooltip(pointFormat = "<b>{}</b>:<br>
                   Worlwide Gross: $ {point.value:,.0f}")

The following error message appears:
Error in hctreemap2(data = db, group_vars = c(“Genresl1”, “Genresl2”) :
Treemap data uses same label at multiple levels.

We can’t design a 2 levels treemap with highcharter because main genres and subgenres share some genres. Hence, R is a great tool for data manipulation but javascript is a better tool for visualization.

We can easily design a 2 levels responsive treemap with the library highchart in javascript.

Don’t hesitate to follow us on twitter @rdata_lu and to subscribe to our youtube channel.
You can also contact us if you have any comments or suggestions. See you for the next post!

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)