IMDB Movie Analysis

[This article was first published on Environmental Science and Data Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Movies of late haven’t impressed (this viewer anyhow) and I keep finding myself returning to classics of the 1980’s and 1990’s to get my movie fix. While browsing through Kaggle datasets, I came across the IMDB 5000 Movie Dataset which contains data on over 5000 movies scraped from the IMDB website. I dove into the data to see what I could find.

Directors of the Top 10 Grossing Movies

James Cameron’s “Titanic” leads the way with gross earnings of $658 672 302. This movie is, of course, one of the most well known of all time but what other movies are amongst the top earners? We see Colin Trevorrow’s “Jurassic World”, Joss Whedon’s “The Avengers” and James Cameron’s “Avatar” have also done very well at the box office. Other directors in this top tier are George Lucas (Star Wars: Episode I – The Phantom Menace, Star Wars: Episode IV – A New Hope), Christopher Nolan (The Dark Knight, The Dark Night Rises) and Andrew Adamson (Shrek 2).

I must admit, it pleases me greatly to see the sci-fi genre doing so well. Apart from “Titanic” and “Shrek 2”, the top earning movies are all sci-fi. The barplot below is stacked when a director has more than one movie in this top 10 gross earnings subset. We see James Cameron takes the number one slot as the director of the movies with the highest gross earnings. I can’t believe that I still haven’t seen “Titanic”.



Top Actors/Actresses

What does the distribution of lead actors/actresses look like in the top 10 grossing movies subset? Chris Hemsworth tops the list for his role as Thor in “The Avengers” and “Avengers: Age of Ultron”. There are just two females in the list, one of which is CCH Pounder who starred in “Avatar”. Leonardo DiCaprio (Titanic), Bryce Dallas Howard (Jurassic World), Christian Bale (The Dark Night), Natalie Portman (Star Wars: Episode I – The Phantom Menace), Harrison Ford (Star Wars: Episode IV – A New Hope), Tom Hardy (The Dark Knight Rises) and Rupert Everett (Shrek 2) make up the rest of the field.



Change in Gross Earnings Over Time

The movie business pulls in big dollars but inflation throughout the 20th century has surely had an impact on both ticket price and, by association, gross earnings at the box office. The line graph below shows how the maximum gross earnings has changed between 1920-2016. Since the 1960’s there has been quite the upward trend in the maximum gross earnings a successful movie can make. However, there is still a lot of inherent variability in these data. The mean gross earnings doesn’t appear to change much while the same can be said for the minimum gross earnings. Why do more recent successful movies pull in higher gross earnings? Inflation of ticket prices over time perhaps?




Are IMDB Score and Gross Earnings Related?

An obvious assumption would be that higher rated movies will have done better at the box office. We can check the IMDB scores for the movies in the dataset and produce the following barplot using rating categories along the x-axis.




As expected, movies with a higher score on IMDB tend to have done better at the box office. We know nothing about the internal variation within these categories, however, as the values given are means. Surely not all movies rated between 9-10 earned  around $150 000 000? The next barplot includes error bars which represent the standard deviations of the categorical data. Here, we see that the variation in gross earnings is very large for the 9-10 IMDB score category. Clearly not all movies rated between 9 and 10 earned as much. The standard deviations for the other categories are far less dramatic.




How Much of an Impact Does Ticket Price Inflation Have?

I scraped movie gross earnings data from Box Office Mojo which has been adjusted for ticket price inflation (using $8.61, the 2016 average ticket price) and used it to create the following barplot. Things look a lot different after this adjustment. James Cameron remains with “Titanic” as does George Lucas with “Star Wars: Episode IV – A New Hope”. However, all of the others from earlier are absent in this new top 10 subset. “Gone with the Wind” directed by Victor Fleming tops the list of the highest gross earnings when the data are adjusted.




So there you have it! Even though modern day movies are endowed with ever more spectacular special effects and stunts, classics from the 1940’s to the 1990’s look to have been more successful in terms of gross earnings. Movies such as “Bambi” and “Pinocchio” have had multiple releases, one must remember, which have contributed to the overall gross earnings value. Had I been born a few decades earlier, I would have been the purchaser of those tickets that contributed to the earnings of these classics. It is interesting that “Jaws” is in there too as I mention this movie in another post of mine which analyses global shark attack data. Check it out here.



To leave a comment for the author, please follow the link and comment on their blog: Environmental Science and Data Analytics. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)