[This article was first published on R – Magic with Lots of Numbers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The great thing about data science is that you can find data in almost every subject and try out new methods independent of the content. I wanted to explore some natural language processing packages and techniques and was looking for interesting textual data beyond Twittr dumps and R mailing list posts. My choice was the complete list of episode summaries from The X-Files collected on Wikipedia. Not because I am a hardcore fan of the series, but because it ran for a long time (9 seasons with over 200 episodes), features a rather narrow topics range. Also, its episodes can be divided into two categories: “Monster of the week” episodes, dealing with all kinds of scary things Mulder and Scully encounter in their cases, and “Mythology Arc” episodes, uncovering The Truth behind the worldwide conspiracy behind alien invasions and abductions. This means we can look at how features of the episode summaries relate to those two categories. All in all, this should make for some entertaining data exploration while we wait for the series reboot coming 2016.

Part 1: Getting the data

Scully: Am I to understand that you want me to debunk the X-Files Project, sir?

Section Chief Blevins: Agent Scully, we trust you’ll make the proper scientific analysis.

The show and all its episodes are well documented on Wikipedia and an almost standard format. There is an overview of all episodes for a season in table format, each of them linking to a detailed description page per episode. I chose Hadley Wickham’s rvest package for the scraping part.

Take a look at the episode tables on Wikipedia: We will be needing 2-6 and 8-11 on this page (we leave out table 7 because it refers to the first X-Files movie). While the data is in a nice table format, its contents are not yet tidy (PDF): For one, the most important information – the link to the episode page – is “hidden” within the link of the episode title. Also, the information if an episode is a “Mythology” episode is hidden in a “double dagger” footnote behind the episode title.

To extract all three data points from the “Title” column, we will need the following function getHrefs :

library(XML)
library(httr)
library(rvest)

getHrefs <- function(node, encoding) {
# The function evaluates every child of an HTML table:
# If it contains an anchor a, if returns its href attribute
# (preceded by the protocol and the base url) plus the link text.
# If not, it just returns the childs text. –
# The function also evaluates if a cell contains a footnote and marks it with a superscript 1 (¹)
# For X-Files, a double dagger footnotes a "mythology arc" episode
#
# Args:
#  node: an html node
#  encoding: text encoding
#
# Return:
#  if the node is a link:
#   the link text, a "¹" sign if it contains an image, and the link URL
#
#  if the node is not a link:
#   the node text

x <- xmlChildren(node)$a y <- ifelse(is.null(xmlChildren(node)$img), "", "¹")

if (!is.null(x))

paste0(xmlValue(x),
y,
"|",
"http://",
parseURI(url)$server, xmlGetAttr(x, "href")) else xmlValue(xmlChildren(node)$text)
}

This function gets packed into a getTable function to scrape tables:

getTable <- function(url, tabnum) {
# function to get the table from HTML and apply getHrefs
#
# Args:
#  url: The URL
#  tabnum: the number of the table in the overall HTML code
#
# Return:
#  a table

doc <- content(GET(url))
tab <- readHTMLTable(doc, which = tabnum, elFun = getHrefs)

tab
}

Finally, we apply the getTable function to scrape all the tables we need and turn it into a data table.

url <- "http://en.wikipedia.org/wiki/List_of_The_X-Files_episodes"

xf <- getTable(url, c(2:6, 8:11)) # tables 2-6, 8-11 (table 7 relates to the first movie)

# consolidate column names
cnames <- c("NrInSeries","NrInSeason","Title","Director","Writer","AirDate","ProdCode","Viewers")
xf <- lapply(xf, setNames, cnames)

# collapse table list into one data frame
xf <- do.call(rbind, xf)

str(xf)
## 'data.frame':    201 obs. of  8 variables:
##  $NrInSeries: Factor w/ 201 levels "1","10","11",..: 1 12 18 19 20 21 22 23 24 2 ... ##$ NrInSeason: Factor w/ 25 levels "1","10","11",..: 1 12 18 19 20 21 22 23 24 2 ...
##  $Title : Factor w/ 201 levels "Beyond the Sea|http://en.wikipedia.org/wiki/Beyond_the_Sea_(The_X-Files)",..: 15 5 20 3 22 17 11 12 19 8 ... ##$ Director  : Factor w/ 62 levels "Daniel Sackheim",..: 16 2 6 1 9 12 7 4 19 11 ...
##  $Writer : Factor w/ 66 levels "Alex Gansa & Howard Gordon",..: 5 3 7 2 3 8 1 8 3 9 ... ##$ AirDate   : Factor w/ 201 levels "April 15, 1994",..: 22 23 24 18 21 19 20 17 15 16 ...
##  $ProdCode : Factor w/ 201 levels "1X01","1X02",..: 24 1 2 3 4 5 6 7 8 9 ... ##$ Viewers   : Factor w/ 171 levels "","10.0","10.4",..: 9 6 6 16 3 15 17 2 4 15 ...

Lots of over-factorized data – time for some cleanup:

library(dplyr)
library(tidyr)

xf2 <- tbl_df(xf)

xf2 <- xf2 %>%
#separate Title, Director, Writer into text and URL columns
separate(Title, into = c("Title","epURL"), sep = "\\|", extra = "merge") %>%
separate(Director, into = c("Director","DirURL"), sep = "\\|", extra = "merge") %>%
separate(Writer, into = c("Writer","WritURL"), sep = "\\|", extra = "merge") %>%
#differentiate between Monster and Mythology episodes
mutate(Mythology = grepl("¹", Title)) %>%
mutate(Title = sub("¹", "", Title)) %>%
#rearrange and drop unnecessary columns
select(ProdCode, Title, epURL, Director, Writer, AirDate, Viewers, Mythology) %>%
# get rid of factors
mutate(ProdCode = as.character(ProdCode),
AirDate = as.character(AirDate),
Viewers = as.numeric(as.character(Viewers))
) %>%
# add title without spaces and punctuation for use in variable names
mutate(VarTitle = paste(ProdCode, gsub("[^[:alnum:]]", "", Title, perl = TRUE), sep = "_"))

str(xf2)
## Classes 'tbl_df', 'tbl' and 'data.frame':    201 obs. of  9 variables:
##  $ProdCode : chr "1X79" "1X01" "1X02" "1X03" ... ##$ Title    : chr  "Pilot" "Deep Throat" "Squeeze" "Conduit" ...
##  $epURL : chr "http://en.wikipedia.org/wiki/Pilot_(The_X-Files)" "http://en.wikipedia.org/wiki/Deep_Throat_(The_X-Files_episode)" "http://en.wikipedia.org/wiki/Squeeze_(The_X-Files)" "http://en.wikipedia.org/wiki/Conduit_(The_X-Files)" ... ##$ Director : chr  "Robert Mandel" "Daniel Sackheim" "Harry Longstreet" "Daniel Sackheim" ...
##  $Writer : chr "Chris Carter" "Chris Carter" "Glen Morgan" "Alex Gansa" ... ##$ AirDate  : chr  "September 10, 1993" "September 17, 1993" "September 24, 1993" "October 1, 1993" ...
##  $Viewers : num 12 11.1 11.1 9.2 10.4 8.8 9.5 10 10.7 8.8 ... ##$ Mythology: logi  TRUE TRUE FALSE FALSE FALSE FALSE ...
##  $VarTitle : chr "1X79_Pilot" "1X01_DeepThroat" "1X02_Squeeze" "1X03_Conduit" ... Nice – everything tidied up. We have even got the “Mythology” tag as a Boolean value in its own column, which will come in handy later. But wait… isn’t there something missing? Right – this is only metadata! What we still need is the episode summaries itself. The summaries are stored as text on separate pages which can be reached via the links in our epURL column. Check out the summary for the pilote episode: What we need can be found in the paragraphs after the heading “Plot”. Since we are dealing with 201 episodes, we will need another function to get them: getPlot <- function(url, t = "Plot", h = "h2") { # function to extract the plot from an episode page # xpath code from http://stackoverflow.com/questions/18167279/trying-to-get-all-p-tag-text-between-two-h2-tags # t refers to span ID, h refers to heading tag level # # Args: # url: the URL of a Wikipedia page containing # t: a heading name (defaults to "Plot") # h: the heading formating (defaults to "h2") # # Return: # the text of the paragraphs after the specified headings # until the next heading, as character string xp = paste0("//p[preceding-sibling::", h, "[1][span='", t, "']]") eplot <- read_html(url) %>% # get the nodes following the h2 "Plot" until the next h2 html_nodes(xpath = xp) %>% # strip tags html_text() %>% # concatenate vector of texts into one string paste(collapse = "") eplot } We will apply this function to all of the entries in our epURL column and store the results in a new column content. Depending on your internet connection, this can take a few minutes. # get plot summaries xf2$content <- unlist(lapply(xf2$epURL, getPlot)) str(xf2$content)
##  chr [1:201] "In Bellefleur, Oregon, teenager Karen Swenson is seen fleeing through the forest. When she falls, a dark figure approaches, and"| __truncated__ ...

Unfortunately, not all episode pages are formatted the same way. One of them, episode 7ABX04, has its plot summary written under a h3 (not h2) title. We will need to import it seperately.

# get missing plot summary from episode 7ABX04
# (plot is nested in "Synopsis" under h3 tag)
xf2$content[xf2$ProdCode == "7ABX04"] <- getPlot(
"https://en.wikipedia.org/wiki/The_Sixth_Extinction_II:_Amor_Fati",
t = "Plot", h = "h3") 

The episode summaries are complete, but unfortunately have a lot of punctuation typos. Especially, a lot of spaces are missing after periods, commas, etc., which will bite us in our tokenization step if we don’t take care of it. So let’s apply a little RegEx magix to our content column:

# replace missing spaces after punctuation in episode summaries
xf2 <- mutate(xf2, content = gsub(pattern = "([.,!?;:])([^ ])", rep = "\\1 \\2", content))

Looks like we have what we need in our table. Let’s get rid off the intermediate data:

# clean up variables and functions
rm(cnames, url, getHrefs, getPlot, getTable, xf)

In part 2, I will show you how to prepare the content data for natural language processing using the tm package.