Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Some years ago I allowed myself to accept a challenge to read the Top 100 Novels of All Time (complete list here). This list was put together by Richard Lacayo and Lev Grossman at Time Magazine.

Then last week I was reading through a back issue of The Linux Journal and came across an article which used shell tools to download and process the IMDb list of Top 250 Movies. This list is constructed from IMDb users’ votes and so represents a fairly democratic and egalitarian perspective. Working through a list of movies seems to me to be a lot easier than a list of books, so this appealed to my inner sloth. And gave me an idea for a quick little R script.

We will use the XML library to retrieve the page from IMDb and parse out the appropriate table.

> library(XML)
>
> url <- "http://www.imdb.com/chart/top">
> best.movies <- readHTMLTable(url, which = 2, stringsAsFactors = FALSE)
>
1   1.    9.2       The Shawshank Redemption (1994) 1,065,332
2   2.    9.2                  The Godfather (1972)   746,693
3   3.    9.0         The Godfather: Part II (1974)   484,761
4   4.    8.9                   Pulp Fiction (1994)   825,063
5   5.    8.9 The Good, the Bad and the Ugly (1966)   319,222
6   6.    8.9                The Dark Knight (2008) 1,039,499


The output reflects the content of the rating table exactly. However, the rank column is redundant since the same information is captured by the row labels. We can remove this column to make the data more concise.

> best.movies[, 1] <- NULL
>
1    9.2       The Shawshank Redemption (1994) 1,065,332
2    9.2                  The Godfather (1972)   746,693
3    9.0         The Godfather: Part II (1974)   484,761
4    8.9                   Pulp Fiction (1994)   825,063
5    8.9 The Good, the Bad and the Ugly (1966)   319,222
6    8.9                The Dark Knight (2008) 1,039,499


There are still a few issues with the data:

• the years are bundled up with the titles;
• the rating data are strings;
• the votes data are also strings and have embedded commas.

All of these problems are easily fixed though.

> pattern = "(.*) \$$(.*)\$$\$"
>
> best.movies = transform(best.movies,
+                       Rating = as.numeric(Rating),
+                       Year   = as.integer(substr(gsub(pattern, "\\2", Title), 1, 4)),
+                       Title  = gsub(pattern, "\\1", Title),
+ )
>
> best.movies = best.movies[, c(4, 2, 3, 1)]
>
1 1994       The Shawshank Redemption 1065332    9.2
2 1972                  The Godfather  746693    9.2
3 1974         The Godfather: Part II  484761    9.0
4 1994                   Pulp Fiction  825063    8.9
5 1966 The Good, the Bad and the Ugly  319222    8.9
6 2008                The Dark Knight 1039499    8.9


I am happy to see that The Good, the Bad and the Ugly rates at number 5. This is one of my favourite movies! Clearly I am not alone.

Finally, to gain a little perspective on the relationship between the release year, votes and rating we can put together a simple bubble plot.

> library(ggplot2)
>
> ggplot(best.movies, aes(x = Year, y = Rating)) +
+   geom_point(aes(size = Votes), alpha = 0.5, position = "jitter", color = "darkgreen") +
+   scale_size(range = c(3, 15)) +
+   theme_classic()


When I have some more time on my hands I am going to use the IMDb API to grab some additional information on each of these movies and see if anything interesting emerges from the larger data set.