Yet Another Movie: IMDB Top 250 movies

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m not a big movie person. Nonetheless I have a media library with quite a few films in and I wondered how many “films to see before you die”-type movies I had in the collection, and how many were missing. I used R to find the answers.

I’ve described previously how to get a plain text dump of a Plex database using WebTools-NG. I did that for the Movies library of my Plex Media Server. Now, for the list of “films to see before you die”. I searched a bit and found a few text files which claimed to be meta-rated as the best. I was a bit suspicious about these. In the end, I figured I should just use to the IMDB’s Top 250 Movies, which could be scraped with rvest.

The code

Let’s get the Top 250 movies:

library(rvest)
library(XML)
library(xml2)
library(fuzzyjoin)
library(dplyr)

# IMDB Top 250 Movies are here
url <- "http://www.imdb.com/chart/top?ref_=nv_wl_img_3"
page <- read_html(url)

movie.nodes <- html_nodes(page,'.titleColumn a')
movie.name <- html_text(movie.nodes)
sec <- html_nodes(page,'.secondaryInfo')
# to get the year we need to remove ) and ( and then get text)
year <- as.numeric(gsub(")","",gsub("\\(","",html_text( sec ))))

rating.nodes <- html_nodes(page,'.imdbRating strong')
rating <- as.numeric(html_text(rating.nodes))

imdb <- data.frame(Title = movie.name,
                   Year = year,
                   Rating = rating)

Now we have a data frame of the movies, with Title, Year and the IMDB rating.

We can load in the Plex library so that we can match them up, but we don’t need all the data.

libfile <- file.choose()
libdf <- read.delim(libfile,sep = "|")

# we only need title and year
pms <- libdf %>% 
  select(Title,Year)

Now we have two data frames to perform the matching.

The first issue is that we can’t simply use the titles for matching because remakes and different versions of movies will cause a mismatch. To get around this we can use Title and Year as a combination for fuzzy matching.

# to match movies, we need Title-Year combination
imdb$titleyear <- paste(imdb$Title, imdb$Year)
pms$titleyear <- paste(pms$Title,pms$Year)

# fuzzy matching
match <- stringdist_join(imdb, pms, 
                         by = 'titleyear',
                         mode ='left',
                         method = "jw",
                         max_dist = 99, # could set this a lot lower 
                         distance_col = 'dist') %>% 
  group_by(titleyear.x) %>% 
  slice_min(order_by = dist, n = 1) # gives the best match found

Fuzzy matching is needed because a simple string comparison will get derailed pretty easily by capitalisation and other minor issues. So we need something a little more forgiving to do the matching.

Now we can have a look at the matches by typing match

> # have a look at matches
> match
# A tibble: 254 × 8
# Groups:   titleyear.x [250]
   Title.x               Year.x Rating titleyear.x                Title.y                   Year.y titleyear.y           dist
   <chr>                  <dbl>  <dbl> <chr>                      <chr>                     <chr>  <chr>                <dbl>
 1 12 Angry Men            1957    9   12 Angry Men 1957          12 Monkeys                1995   12 Monkeys 1995     0.282 
 2 12 Years a Slave        2013    8.1 12 Years a Slave 2013      Oz the Great and Powerful 2013   Oz the Great and P… 0.348 
 3 1917                    2019    8.2 1917 2019                  Cats                      2019   Cats 2019           0.296 
 4 2001: A Space Odyssey   1968    8.3 2001: A Space Odyssey 1968 2001: A Space Odyssey     1968   2001: A Space Odys… 0     
 5 3 Idiots                2009    8.3 3 Idiots 2009              The Incredibles           2004   The Incredibles 20… 0.286 
 6 A Beautiful Mind        2001    8.2 A Beautiful Mind 2001      Beautiful Noise           2014   Beautiful Noise 20… 0.312 
 7 A Clockwork Orange      1971    8.2 A Clockwork Orange 1971    A Clockwork Orange        1972   A Clockwork Orange… 0.0290
 8 A Separation            2011    8.2 A Separation 2011          Separado!                 2010   Separado! 2010      0.189 
 9 Aladdin                 1992    8   Aladdin 1992               Aladdin                   1992   Aladdin 1992        0     
10 Alien                   1979    8.4 Alien 1979                 Alien                     1979   Alien 1979          0     
#  244 more rows
#  Use `print(n = ...)` to see more rows

We have several perfect matches in the first 10 rows. These have a distance of 0. There are some less-good-but-still-matches, such as A Clockwork Orange where the year differs between IMDB and Plex. Then there are a bunch of clear “not matched” movies, e.g. 12 Angry Men, 12 Years a Slave. We can see that a distance of 0.1 or more means the match is not true.

Note that it says there are 244 more rows and shows us 10 (a total of 254 when we should have only 250). The 4 extra matches are duplicates caused by a same-distance match to two different movies in the Plex library. Let’s get rid of them and then figure out our totals.

# remove duplicate fuzzy match fails
match <- match[!duplicated(match[ , "titleyear.x"]), ]
# leaves 250 rows

# matches with distance of 0.1 or more are not a match
# the are the movies we want to look at
match <- match %>% 
  filter(dist >= 0.1)
# leaves 190 rows

So I have 60 of the IMDB’s Top 250 Movies. This is not very high. In my defence, I am not a movie buff and my movie collection is not particularly huge.

So what are those movies that I am missing? Let’s sort them to be the highest rated and figure out what I should add with some urgency!

match <- match[order(-match$Rating),]

# write file
lapply(match$titleyear.x, write, "Output/Data/imdb.txt", append=TRUE)
The Shawshank Redemption19949.2
12 Angry Men19579
The Dark Knight20089
Schindler’s List19938.9
The Good, the Bad and the Ugly19668.8
Fight Club19998.7
Inception20108.7
Interstellar20148.6
It’s a Wonderful Life19468.6
Life Is Beautiful19978.6
The top 10 films I was missing…

I have at least seen some of those films at some point in the past.

The post title comes from “Yet Another Movie” by Pink Floyd from “A Momentary Lapse of Reason”.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)