Fun stuff with subtitles or "The Tarantino Threshold"

January 13, 2013
By

(This article was first published on Rcrastinate, and kindly contributed to R-bloggers)

Fortunately, there is a page called www.opensubtitles.org, where you can get subtitle (.SRT) files for virtually every movie. Now let's see what we can do with these. SRT files are in plain text format (human readable) and can thus be read quite easily with R.

First thing we need is a reading function for an SRT file. This function is quite long and boring. I won't talk you through it. Here it is. If you ever use this function, please tell people where you got it! Note that it needs the SRTs in UTF-8 charset.


read.srt <- function (file) {
  scan.file <- scan(file, what = "character", sep = "\n", encoding = "UTF-8", quiet = T)
  arrows <- grep("-->", scan.file, fixed = T)
  subtitles <- c()
  for (arrow.i in 1:length(arrows)) {
    if (arrow.i < length(arrows)) {
      subs <- scan.file[(arrows[arrow.i]+1):(arrows[arrow.i+1]-2)]
      subtitles <- c(subtitles, subs) }
    else {
      subs <- file[arrows[arrow.i]+1] } }
  words <- c()
  for (sent in subtitles) {
    sent <- gsub("<i>", "", sent)
    sent <- gsub("</i>", "", sent)
    sent <- tolower(gsub("[[:punct:]]", "", sent))
    sent.spl <- strsplit(sent, " ", fixed = T)[[1]]
    words <- c(words, sent.spl) }
  words[words != ""] }

This function returns a looong vector with all the words spoken in the respective movie. With this vector, we can do more stuff. One thing particularly interesting is bad language - swear words. Let us compare movies in terms of how many percent of all the words spoken in the movie are considered "bad words". So I defined a bad-word-list (f-word, several excremental expressions, expressions for female and male genitalia, expressions for the buttocks and so on), stem all the words (using the Rstem package) in a vector and search using regular expressions (we'll come back to this later). The function for this is called swear.ratio(). It's been some time since I wrote this function, and I today I would use sets in the regular expressions. But it works...


swear.ratio <- function (words, language = "english") {
  if (language == "english") {
    swear.words <- <<insert many mean words here>> }
  else {
    stop("Only language == 'english' implemented.") }
  n.words <- length(words)
  stems <- wordStem(words, language = language)
  n.swear <- 0
  for (swear.word in swear.words) {
    n.swear.word <- length(grep(swear.word, stems))
    n.swear <- n.swear + n.swear.word }
  cat(round((n.swear / n.words) * 100, 3), "percents swear words.\n")
  n.swear / n.words * 100 }

Now I search for all the SRT files in one directory, read those files and put them into a list. The element names in this list are the names of the SRT files without the ".srt". In my case, these are the movie titles. 

srt.files <- list.files(<<path to SRT files>>, full.names = T)
srt.list <- list()
for (srt in srt.files) {
movie.name <- strsplit(srt, "/", fixed = T)[[1]]
movie.name <- gsub(".srt", "", movie.name[length(movie.name)], fixed = T)
srt.list[[movie.name]] <- read.srt(srt) }

What we get for one movie is this. These are the first 9 words spoken in the third movie in the list, do you know the movie? I'm sure you do!):

> srt.list[[3]][1:9]
[1] "people" "always" "ask"    "me"     "if"     "i"      "know"   "tyler"  "durden"

So now I'm iterating over this list and create a dataframe with movie information, for now only swear ratio and type-token-ratio using word stems (the number of all different words vs. the number of all words, maximum 1).

type.token <- function (words, language = "english") {
  stems <- wordStem(words, language = language)
  unique.stems <- intersect(stems, stems)
  length(unique.stems) / length(stems) }

srt.df <- data.frame()
for (movie in names(srt.list)) {
  subs <- srt.list[[movie]]
  srt.df <- rbind(srt.df,
  data.frame(movie, swear.ratio = swear.ratio(subs), type.token = type.token(subs))) }

Finally, here is the fun part: Plotting swear ratios:
dotchart(srt.df$swear.ratio, labels = srt.df$movie, col = "blue", pch = 19)
abline(v = srt.df[srt.df$movie == "inglourious basterds", "swear.ratio"], col = "red", lwd = 2)

Click on this plot to read the labels.

"Reservoir Dogs" takes home the "Swear-a-lot Cup" with roughly 3% of all spoken words being bad words. It's kind of relieving that "Finding Nemo" being the only real movie for kids in the sample indeed is the one with the least swear word ratio. By the way, all of the hits in "Finding Nemo" are:
- butt (4 times)
- butter (1 time)
- class (3 times)
- passed (3 times)
- assure (1 time)
Here we see a problem using regular expression matching. "class", "passed" and "assure" are also found by a search for "ass". So, the 4 occurrences of "butt" seem to be the only real bad words used in "Finding Nemo". 

You might wonder what the red line in the plot indicates. As you can see in the plotting command, the red line indicates how many swear words there are in "Inglourious Basterds" (IG). IG is the Tarantino movie with the least amount of swear words (as measured by total words / swear words ratio which is roughly 1% for IG). So I call the red line the "Tarantino threshold".

There are several movies getting over the Tarantino threshold. For example "Shawshank Redemption" (excuse the extra space in the plot) and also "Fight Club". Together with "Goodfellas" these are the only three movies in our sample getting over the Tarantino threshold which are not directed by Quentin Tarantino.

I will do more stuff with these swear word ratios another time. For now, let's plot type-token-ratios, which can be considered a measure for "lexical diversity" throughout the movie.

No clear patterns for Tarantino's movies this time. Indeed, the movie with the least lexical diversity ("Jackie Brown") and the one with the second highest ("Kill Bill") are both directed by Tarantino. There is one problem with this plot: Type token ratio is positively correlated with the length of the movie because the longer the movie is, the higher the probability for the occurrence of new words is. This could be a reason for "The Godfather 2" ranking that high in this plot. So, in one of my next posts I'll try to incorporate the length of a movie into the analyses and see what we can get out of this.

But that's it for today... bye and see you soon.







To leave a comment for the author, please follow the link and comment on his blog: Rcrastinate.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.