The Truth Is In There – an X-Files episode analysis with R (Part 3)

[This article was first published on R – Magic with Lots of Numbers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Cornfield_Texas_Mulder_Scully_Fight_the_Future

Part 3: Monster vs. Mythology word use

Mulder: We’ve both lost so much… but I believe that what we’re looking for is in the X-Files. I’m more certain than ever that the truth is in there.

Scully: I’ve heard the truth, Mulder. Now what I want are the answers.

This is part 3 of a natural language processing analysis of X-Files episode summaries. In parts 1 and 2 I explained how to get the data from Wikipedia and pre-process it for analysis. Now we can start working with the data.

Our data is stored in a neat document term matrix (DTM), along with meta-data. That means, every row of our data frame contains one of the 201 X-Files episodes. Our first 10 columns contain information about the episode, while the other 6000+ columns contain an alphabetized list of words and how often these words appear in each episode.

Let’s look at what the most prominent words are in the collection of episode summaries. Remember, our corpus cleanup got rid of English stop words (“the”, “and”, etc.) as well as the names of actors (i.e. “Duchovny” won’t appear in the list – but “Mulder” will).

The total count of the words could be misleading though – episode summary length or the way summaries are written skew the ratio. To “normalize” these effects, we will concentrate on if a word appears in an episode or not. For this, we need to transform the word count in our DTM to a Boolean value – TRUE if the count is 1 or greater, FALSE if the count is zero. For this, we define a simple function numToBool

numToBool <- function(x) {
        # function to set counts to boolean values
        #
        # Args:
        #  x: an integer
        #
        # Return:
        #  a Boolean value (>=1 TRUE, <1 FALSE) 
        
        ifelse(is.numeric(x), x > 0, NA)
}

We apply this function to all of the DTM cells.

xfmergedTF <- xfmerged

# change word counts to Boolean values ()
xfmergedTF[ , -c(1:10)] <- as.data.frame(lapply(
        xfmergedTF[ , -c(1:10)],FUN = function(x) {
                sapply(x, FUN=numToBool)
                }
        ))

str(xfmergedTF[ , -c(1:10)], list.len = 5)
## 'data.frame':    201 obs. of  6152 variables:
##  $ aaa            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ aaronson       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ aback          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ abandon        : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $ abduct         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##   [list output truncated]

Now we get the sums of all DTM columns (i.e. the number of episodes each word appears in) and divide it by the number of episodes to get a relative value.

nxf <- nrow(xfmergedTF)

# get total Boolean counts (all episodes)
xfcountAll <- xfmergedTF[ , -c(1:10)] %>% 
        colSums() %>% 
        sort(decreasing = TRUE) %>% 
        `/`(nxf)  #divide by number of episodes

Let’s look at the first 50 terms:

# show terms with largest counts
head(xfcountAll, 50)
##    sculli      find    mulder     agent      kill      tell     later 
## 0.9900498 0.9353234 0.8905473 0.8457711 0.7810945 0.7810945 0.7462687 
##  meanwhil    believ     arriv       man       one      leav      meet 
## 0.7313433 0.7213930 0.6616915 0.6517413 0.6368159 0.6169154 0.6119403 
##       see      bodi  investig    reveal     howev       two       fbi 
## 0.6069652 0.5970149 0.5970149 0.5920398 0.5820896 0.5721393 0.5572139 
##    discov    return      back      take     claim     visit       die 
## 0.5522388 0.5373134 0.5323383 0.5273632 0.5174129 0.5024876 0.4825871 
##       tri      also      head      home       car      case    realiz 
## 0.4776119 0.4726368 0.4577114 0.4527363 0.4427861 0.4427861 0.4427861 
##      dead     found    appear    attack     insid      show      work 
## 0.4378109 0.4378109 0.4328358 0.4278607 0.4278607 0.4278607 0.4228856 
##   attempt    murder     escap      name      room     anoth     death 
## 0.4179104 0.4179104 0.4129353 0.4129353 0.4129353 0.4079602 0.4079602 
##       ask 
## 0.4029851

Pretty grim terminology, isn’t it? Unsurprisingly, we can see our two main characters’ names taking top spots in our list, along with procedural terms such as “case”, “fbi” and “investigate”. Most of the terms seem to be verbs, the most prominent one being “kill”. Words like “body”, “die”, “dead” and “murder” point to a rather dangerous world the protagonists live in.

What we want to find out now is how the terminology for the “Monster of the Week” and the “Mythology Arc” differ. For this, we will construct a slopegraph of top terms with strong differences. The first step is to separate the two datasets:

# get Boolean counts for Monster and Mythology episodes
xfcountTFMonster <- filter(xfmergedTF, Mythology == FALSE)
xfcountTFMyth <- filter(xfmergedTF, Mythology == TRUE)

Our graph will be based on the most prevalent terms from both of theses lists. We therefore need to sort, rank and reconnect them. Since we need to repeat the sorting/ranking for both sets, it makes sense to construct a function. prepCountList() will prepare a ranked list our our terms, sorted by frequency.

prepCountList <- function(df, n) {
        # function to create a ranked list of relative Boolean counts
        #
        # Args:
        #  df: data frame based on the data format in xfmerged
        #  n:  count of episodes
        #
        # Return:
        #  df: a data frame of relative occurence (part),
        #      term and rank
        
        df <- df[ , -c(1:10)] %>%               # get rid of metadata
                colSums() %>%                   # get overall term counts
                sort(decreasing = TRUE) %>%     # sort high to low
                `/`(n) %>%                      # divide by episode count
                as.data.frame()                 # make data frame
        
        df <- mutate(df, term = rownames(df))   # get row names as variable
        df$rank <- 1:nrow(df)                   # add ranks
        
        colnames(df) <- c("part", "term", "rank") # rename col names
        
        df
}

Now we’ll apply the function to the two data frames:

# get number of episodes
nmyth = nrow(xfcountTFMyth)
nmon = nrow(xfcountTFMonster)

nmyth
## [1] 60
nmon
## [1] 141
# create ranked lists for Monster and Mythology episodes
xfcountTFMonster <- prepCountList(xfcountTFMonster, nmon)
xfcountTFMyth <- prepCountList(xfcountTFMyth, nmyth)

head(xfcountTFMonster)
##        part   term rank
## 1 0.9858156 sculli    1
## 2 0.9432624   find    2
## 3 0.8723404  agent    3
## 4 0.8439716 mulder    4
## 5 0.7801418   kill    5
## 6 0.7446809  later    6
head(xfcountTFMyth)
##        part     term rank
## 1 1.0000000   mulder    1
## 2 1.0000000   sculli    2
## 3 0.9166667     find    3
## 4 0.8666667 meanwhil    4
## 5 0.8666667     tell    5
## 6 0.7833333    agent    6

We join the two sets based on the term, so we’ll have both ranked lists in one data frame. We take the top 30 terms from both lists and throw out all terms where the difference in episode occurence is 10% or less. What remains is a list of the terms with the biggest difference between “Monster” and “Mythology” episodes.

# join the two ranked lists
xfcountmerged <- full_join(xfcountTFMyth, xfcountTFMonster, by = "term")

str(xfcountmerged)
## 'data.frame':    6152 obs. of  5 variables:
##  $ part.x: num  1 1 0.917 0.867 0.867 ...
##  $ term  : chr  "mulder" "sculli" "find" "meanwhil" ...
##  $ rank.x: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ part.y: num  0.844 0.986 0.943 0.674 0.745 ...
##  $ rank.y: int  4 1 2 9 7 3 5 14 184 8 ...
# create data for slopegraph (top 30 terms for both episode types)
# exclude difference below 10%
xfcountmerged <- xfcountmerged %>% 
        filter(rank.x <= 35 | rank.y <= 35) %>% 
        select(myth = part.x, monster = part.y, term) %>%
        filter(abs(myth-monster) > .1) %>% 
        mutate(myth = round(myth*100,0), monster = round(monster*100,0))

To construct the slopegraph, we’ll use a slightly adapted version of Nathan Yau’s plotting function. For space reasons, the code is sourced from a separate file in my GitHub repo.

# create slopegraph
source("slopegraph.r")

with(xfcountmerged, slopegraph(myth, monster, term))

unnamed-chunk-29-1

## [1] "Plot generated: "

The “Mythology” episodes have more aliens and abductions in them; Agent Mulder is mentioned in all of these episodes, and Agent Skinner takes an important role in them too. Protagonists seem to move around more and meet more people (“visit”, “arrive”, “leave”, “return”), and more information is exchanged between people (“tell”, “reveal”, “inform”).

“Monster” episodes on the other hand feature more bodies, attacks< and murders, and the situations are described as “cases”. Names and homes are more important.

The differences in the other terms (“meanwhile”, “tri/try”) are difficult to interpret and probably have to do with narrative structures in the episode summaries.

Finally, some cleanup:

# cleanup
rm(xfcountmerged, xfcountTFMonster, xfcountTFMyth, xfmergedTF,
   nmon, nmyth, nxf, xfcountAll, numToBool, prepCountList, slopegraph)

This concludes part 3 of our analysis. In part 4, we will look at characters and their occurence over seasons. Stay tuned!

To leave a comment for the author, please follow the link and comment on their blog: R – Magic with Lots of Numbers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)