[This article was first published on R – Magic with Lots of Numbers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Part 3: Monster vs. Mythology word use

Mulder: We’ve both lost so much… but I believe that what we’re looking for is in the X-Files. I’m more certain than ever that the truth is in there.

Scully: I’ve heard the truth, Mulder. Now what I want are the answers.

This is part 3 of a natural language processing analysis of X-Files episode summaries. In parts 1 and 2 I explained how to get the data from Wikipedia and pre-process it for analysis. Now we can start working with the data.

Our data is stored in a neat document term matrix (DTM), along with meta-data. That means, every row of our data frame contains one of the 201 X-Files episodes. Our first 10 columns contain information about the episode, while the other 6000+ columns contain an alphabetized list of words and how often these words appear in each episode.

Let’s look at what the most prominent words are in the collection of episode summaries. Remember, our corpus cleanup got rid of English stop words (“the”, “and”, etc.) as well as the names of actors (i.e. “Duchovny” won’t appear in the list – but “Mulder” will).

The total count of the words could be misleading though – episode summary length or the way summaries are written skew the ratio. To “normalize” these effects, we will concentrate on if a word appears in an episode or not. For this, we need to transform the word count in our DTM to a Boolean value – TRUE if the count is 1 or greater, FALSE if the count is zero. For this, we define a simple function `numToBool`

``````numToBool <- function(x) {
# function to set counts to boolean values
#
# Args:
#  x: an integer
#
# Return:
#  a Boolean value (>=1 TRUE, <1 FALSE)

ifelse(is.numeric(x), x > 0, NA)
}``````

We apply this function to all of the DTM cells.

``````xfmergedTF <- xfmerged

# change word counts to Boolean values ()
xfmergedTF[ , -c(1:10)] <- as.data.frame(lapply(
xfmergedTF[ , -c(1:10)],FUN = function(x) {
sapply(x, FUN=numToBool)
}
))

str(xfmergedTF[ , -c(1:10)], list.len = 5)``````
``````## 'data.frame':    201 obs. of  6152 variables:
##  \$ aaa            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  \$ aaronson       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  \$ aback          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  \$ abandon        : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  \$ abduct         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##   [list output truncated]``````

Now we get the sums of all DTM columns (i.e. the number of episodes each word appears in) and divide it by the number of episodes to get a relative value.

``````nxf <- nrow(xfmergedTF)

# get total Boolean counts (all episodes)
xfcountAll <- xfmergedTF[ , -c(1:10)] %>%
colSums() %>%
sort(decreasing = TRUE) %>%
`/`(nxf)  #divide by number of episodes``````

Let’s look at the first 50 terms:

``````# show terms with largest counts
``````##    sculli      find    mulder     agent      kill      tell     later
## 0.9900498 0.9353234 0.8905473 0.8457711 0.7810945 0.7810945 0.7462687
##  meanwhil    believ     arriv       man       one      leav      meet
## 0.7313433 0.7213930 0.6616915 0.6517413 0.6368159 0.6169154 0.6119403
##       see      bodi  investig    reveal     howev       two       fbi
## 0.6069652 0.5970149 0.5970149 0.5920398 0.5820896 0.5721393 0.5572139
##    discov    return      back      take     claim     visit       die
## 0.5522388 0.5373134 0.5323383 0.5273632 0.5174129 0.5024876 0.4825871
##       tri      also      head      home       car      case    realiz
## 0.4776119 0.4726368 0.4577114 0.4527363 0.4427861 0.4427861 0.4427861
##      dead     found    appear    attack     insid      show      work
## 0.4378109 0.4378109 0.4328358 0.4278607 0.4278607 0.4278607 0.4228856
##   attempt    murder     escap      name      room     anoth     death
## 0.4179104 0.4179104 0.4129353 0.4129353 0.4129353 0.4079602 0.4079602
## 0.4029851``````

Pretty grim terminology, isn’t it? Unsurprisingly, we can see our two main characters’ names taking top spots in our list, along with procedural terms such as “case”, “fbi” and “investigate”. Most of the terms seem to be verbs, the most prominent one being “kill”. Words like “body”, “die”, “dead” and “murder” point to a rather dangerous world the protagonists live in.

What we want to find out now is how the terminology for the “Monster of the Week” and the “Mythology Arc” differ. For this, we will construct a slopegraph of top terms with strong differences. The first step is to separate the two datasets:

``````# get Boolean counts for Monster and Mythology episodes
xfcountTFMonster <- filter(xfmergedTF, Mythology == FALSE)
xfcountTFMyth <- filter(xfmergedTF, Mythology == TRUE)``````

Our graph will be based on the most prevalent terms from both of theses lists. We therefore need to sort, rank and reconnect them. Since we need to repeat the sorting/ranking for both sets, it makes sense to construct a function. `prepCountList()` will prepare a ranked list our our terms, sorted by frequency.

``````prepCountList <- function(df, n) {
# function to create a ranked list of relative Boolean counts
#
# Args:
#  df: data frame based on the data format in xfmerged
#  n:  count of episodes
#
# Return:
#  df: a data frame of relative occurence (part),
#      term and rank

df <- df[ , -c(1:10)] %>%               # get rid of metadata
colSums() %>%                   # get overall term counts
sort(decreasing = TRUE) %>%     # sort high to low
`/`(n) %>%                      # divide by episode count
as.data.frame()                 # make data frame

df <- mutate(df, term = rownames(df))   # get row names as variable
df\$rank <- 1:nrow(df)                   # add ranks

colnames(df) <- c("part", "term", "rank") # rename col names

df
}``````

Now we’ll apply the function to the two data frames:

``````# get number of episodes
nmyth = nrow(xfcountTFMyth)
nmon = nrow(xfcountTFMonster)

nmyth``````
``## [1] 60``
``nmon``
``## [1] 141``
``````# create ranked lists for Monster and Mythology episodes
xfcountTFMonster <- prepCountList(xfcountTFMonster, nmon)
xfcountTFMyth <- prepCountList(xfcountTFMyth, nmyth)

``````##        part   term rank
## 1 0.9858156 sculli    1
## 2 0.9432624   find    2
## 3 0.8723404  agent    3
## 4 0.8439716 mulder    4
## 5 0.7801418   kill    5
## 6 0.7446809  later    6``````
``head(xfcountTFMyth)``
``````##        part     term rank
## 1 1.0000000   mulder    1
## 2 1.0000000   sculli    2
## 3 0.9166667     find    3
## 4 0.8666667 meanwhil    4
## 5 0.8666667     tell    5
## 6 0.7833333    agent    6``````

We join the two sets based on the term, so we’ll have both ranked lists in one data frame. We take the top 30 terms from both lists and throw out all terms where the difference in episode occurence is 10% or less. What remains is a list of the terms with the biggest difference between “Monster” and “Mythology” episodes.

``````# join the two ranked lists
xfcountmerged <- full_join(xfcountTFMyth, xfcountTFMonster, by = "term")

str(xfcountmerged)``````
``````## 'data.frame':    6152 obs. of  5 variables:
##  \$ part.x: num  1 1 0.917 0.867 0.867 ...
##  \$ term  : chr  "mulder" "sculli" "find" "meanwhil" ...
##  \$ rank.x: int  1 2 3 4 5 6 7 8 9 10 ...
##  \$ part.y: num  0.844 0.986 0.943 0.674 0.745 ...
##  \$ rank.y: int  4 1 2 9 7 3 5 14 184 8 ...``````
``````# create data for slopegraph (top 30 terms for both episode types)
# exclude difference below 10%
xfcountmerged <- xfcountmerged %>%
filter(rank.x <= 35 | rank.y <= 35) %>%
select(myth = part.x, monster = part.y, term) %>%
filter(abs(myth-monster) > .1) %>%
mutate(myth = round(myth*100,0), monster = round(monster*100,0))``````

To construct the slopegraph, we’ll use a slightly adapted version of Nathan Yau’s plotting function. For space reasons, the code is sourced from a separate file in my GitHub repo.

``````# create slopegraph
source("slopegraph.r")

with(xfcountmerged, slopegraph(myth, monster, term))``````

``## [1] "Plot generated: "``

The “Mythology” episodes have more aliens and abductions in them; Agent Mulder is mentioned in all of these episodes, and Agent Skinner takes an important role in them too. Protagonists seem to move around more and meet more people (“visit”, “arrive”, “leave”, “return”), and more information is exchanged between people (“tell”, “reveal”, “inform”).

“Monster” episodes on the other hand feature more bodies, attacks< and murders, and the situations are described as “cases”. Names and homes are more important.

The differences in the other terms (“meanwhile”, “tri/try”) are difficult to interpret and probably have to do with narrative structures in the episode summaries.

Finally, some cleanup:

``````# cleanup
rm(xfcountmerged, xfcountTFMonster, xfcountTFMyth, xfmergedTF,
nmon, nmyth, nxf, xfcountAll, numToBool, prepCountList, slopegraph)``````

This concludes part 3 of our analysis. In part 4, we will look at characters and their occurence over seasons. Stay tuned!