Every time I download subtitles for a movie or a series episode, I cannot help thinking about all this text that could be analyzed with text mining methods. As my PhD came to its end in April and I started a postdoc in September, I could use some time of my looong summer break to work on a new R package,
subtools which aims to provide a toolbox to read, manipulate and write subtitles files in R.
In this post I will present briefly the functions of
subtools. For more details you can check the documentation.
The package is available on GitHub.
You can install it using the devtools package:
A subtitles file can be read directly from R with the function
read.subtitles. Currently, four formats of subtitles are supported: SubRip, (Advanced) Substation Alpha, MicroDVD and SubViewer. The parsers are probably not optimal but they seem to do the job as expected, at least with valid files. The package also provides some wrappers to import whole directories of series subtitles (see
subtools package stores imported subtitles as simple S3 objects of class
Subtitles. They are lists with two main components : subtitles (IDs, timecodes and texts) and optional meta-data. Multiple Subtitles objects can be stored as a list of class
MultiSubtitles. The subtools package provides functions to easily manipulate
Basically, you can :
- combine subtitles objects with
- extract parts of subtitles with
- clean subtitles content with
- reorganize subtitles as sentences with
Although you can conduct statistical analyses on subtitles objects, the package subtools is not designed for text mining. The following functions allow you to convert subtitles to other classes/format to analyze them. You can:
- extract text content as a simple character string with
- convert subtitles and meta-data to a virtual corpus with
tmCorpusif you want to work with the standard text mining framework tm.
- convert subtitles and meta-data to a data.frame with
subDataFrame. If you want to use tidy data principles with tidytext and dplyr, you should probably start here.
- finally, it’s also possible to write subtitles objects to a file with
write.subtitles. Though, it is unclear to me if there is any sense in doing that
Application: Game of Thrones subtitles wordcloud
To illustrate how subtools can be used to get started with a subtitles analysis project, I propose to create a wordcloud showing the most frequent words in the popular TV series Game of Thrones. We will use subtools to import the subtitles, the
SnowballC packages to pre-process the data and finally
wordcloud to generate the cloud. I will not provide the data I use here, because subtitles files are in a grey zone concerning licensing. But no worries, it’s pretty easy to find subtitles on Internet.
library(subtools) library(tm) library(SnowballC) library(wordcloud)
Because the subtitles are correctly organized in directories, we import them in one command line using the function
read.subtitles.serie. The nice thing is that this function will try to automatically extract basic meta-data (like series title, season number and episode number) from directories/files name.
a <- read.subtitles.serie(dir = "/subs/Game of Thrones/")
a is a
MultiSubtitles object with 60 Subtitles elements (episodes). We can convert it directly to a tm corpus using tmCorpus. Note that meta-data are preserved.
c <- tmCorpus(a)
And then, we can prepare the data:
c <- tm_map(c, content_transformer(tolower)) c <- tm_map(c, removePunctuation) c <- tm_map(c, removeNumbers) c <- tm_map(c, removeWords, stopwords("english")) c <- tm_map(c, stripWhitespace)
Compute a term-document matrix and aggregate counts by season:
TDM <- TermDocumentMatrix(c) TDM <- as.matrix(TDM) vec.season <- rep(1:6, each = 10) TDM.season <- t(apply(TDM, 1, function(x) tapply(x, vec.season, sum))) colnames(TDM.season) <- paste("S", 1:6)
And finally plot the cloud!
set.seed(100) comparison.cloud(TDM.season, title.size = 1, max.words = 100, random.order = T)
Few words about this plot. Like every wordcloud, I think it’s a very simple and limited descriptive way to represent the information. However, I like it. The people who have watched the TV show will look at it and say « Oh of course! ». In one hundred word, the cloud is not revealing the scenario of GoT, but for each season I can see one or two critical events popping out (wedding, kill/joffrey, queen/shame, hodor/hold/door).
What I find funny here (perhaps interesting?), is that these very important and emotional moments are supported in the dialogs by the repetition of one or two keywords. I don’t know if this is exclusive to GoT, or if it’s a trick of my mind, or something else. And I will not make any hypothesis, I’m not a linguist. But this is the first idea which came to my mind and I wanted to write it down.
Now, there are plenty of text-mining ideas and hypotheses and methods that can be tested with movies and series subtitles. So have fun