Arthur blinked, Ford shrugs, but Zaphod leapt; text as graph

[This article was first published on Category R on Roel's R-tefacts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Can we make the computer say something about characters in a book? In this piece I will search for the names of characters and the words around those names in books. What can we learn about a character from text analysis? Of course it’s also just another excuse for me to read the Hitchhikers series! I will break down the text into chunks of two words, extract the word pairs that matter and visualize the results. Come an play along with your favorite book.

Inspiration

The Hitchhiker’s Guide to the Galaxy (HHGTTG for short) is a series of scifi novels1 about the last surviving man, Arthur Dent, after the destruction of earth. And I love it.

Arthur is an anti-hero, he’s not brave, he’s not smart, he sort of stumbles through life and things happen to him, he has almost no agency. That is something you realize while you are reading the book, but can we find that in the text as well? What can we find out about the actions of people in a book by looking at words around subjects in a book?

I’m inspired by the amazing text work of Julia Silge. For instance on “She giggles, he gallops”2 (link at the bottom of page, really check it out!) Julia and her co-writers scanned through 2000 movie manuscripts and extracted words around gender pronouns. They show that women will more likely snuggle, giggle, squeal, and sob, relative to men. Conversely, men are more likely to strap, gallop, shoot, howl, and kill. See more in she giggles, he gallops.

So what can we do with these books? I will walk through all the main characters (and Marvin) and at the end I also look at gender pronouns (he vs she). There is already a problem: there is only one woman in the cast. This could be a problem.

Many of the steps I take here are copied from the tidytextmining book.

Analyses

Approach

  • I used the pubcrawl package to load in every book.
  • use the packages tidygraph, tidyverse, tidytext, ggraph
  • create bigrams of text, (go through and extract every two words)
  • take more frequent subset of that
  • select characters or actions and display them.

leading to:

Arthur Dent is confused (words connected to Arthur)

Arthur Dent is confused (words connected to Arthur)

Data

The books are of course copyrighted so I cannot share the files with you, that would be piracy. books loaded, cannot do it per chapter, because the chapters are messed up in some of the books. I do it per book. that’s life.

packages

  • I use pubcrawl to read in the files see previous post
  • tidytext for the tokenization, part of speech tagging
  • tidygraph not that much, but for the conversion to graph (could do it with igraph too)
  • ggraph to plot the images
  • tidyverse (mostly dplyr, and ggplot, a bit of stringr ) for everything else

Loading the data

HHGTTG <- readr::read_rds("link_to_your_book")

Loading the packages

# I know, a lot! 
library(tidytext)
library(tidygraph)
library(ggraph)
library(tidyverse)
library(scales)

create bigrams of text, (go through and extract every two words)

unnested_hhgttg <- 
    HHGTTG %>% 
    unnest_tokens(output = "bigram", content, token = "ngrams", n=2) %>% 
    group_by(book) %>% 
    count(bigram,sort = TRUE)
unnested_hhgttg %>% head()
## # A tibble: 6 x 3
## # Groups:   book [4]
##   book                                    bigram     n
##   <chr>                                   <chr>  <int>
## 1 5 Mostly Harmless                       of the   421
## 2 3 Life, the Universe and Everything     of the   385
## 3 2 Restaurant at the End of the Universe of the   377
## 4 1 Hitchhiker's Guide to the Galaxy      of the   301
## 5 5 Mostly Harmless                       it was   296
## 6 5 Mostly Harmless                       in the   250

Oh no, too many stop words! We have to remove them with the stop words data set from tidytext.

unnested_hhgttg <- 
    unnested_hhgttg %>%
    separate(bigram, c("word1", "word2"), sep = " ") %>%
    filter(!word1 %in% stop_words$word) %>%
    filter(!word2 %in% stop_words$word)
unnested_hhgttg %>% head()
## # A tibble: 6 x 4
## # Groups:   book [4]
##   book                                    word1     word2          n
##   <chr>                                   <chr>     <chr>      <int>
## 1 2 Restaurant at the End of the Universe ford      prefect       53
## 2 2 Restaurant at the End of the Universe zaphod    beeblebrox    42
## 3 3 Life, the Universe and Everything     arthur    dent          33
## 4 5 Mostly Harmless                       perfectly normal        33
## 5 1 Hitchhiker's Guide to the Galaxy      ford      prefect       28
## 6 2 Restaurant at the End of the Universe arthur    dent          26

Hmm these people are called by their complete name a lot.

Take only the most frequently used bigrams

I am no longer really interested in the book I found words in.

person_actions <- 
    unnested_hhgttg %>% 
    filter(n >1) %>% 
    filter(word2 != "er") %>%  # don't know what it is, but I don't want it
    ungroup() %>% 
    filter(word1 %in% c("ford", "zaphod", "arthur", "tricia", "trillian")) %>% 
    group_by(word1, word2) %>% 
    summarise(n = sum(n))
head(person_actions,4)
## # A tibble: 4 x 3
## # Groups:   word1 [1]
##   word1  word2        n
##   <chr>  <chr>    <int>
## 1 arthur blinked      4
## 2 arthur couldn’t     2
## 3 arthur dent        93
## 4 arthur dent's       4

Create a cool plotting theme

I will plot several of the same sort of network plots and I don’t want to set the theme every time. You could use: theme_set() but I created a function here. the function takes data, and you can set a title and subtitle, no need for NSE here.

I use a black background, because, you know, space! it’s big! And empty, and empty is black, because absence of light is dark… Or something. And there is a lot of black because the Guide says:

Space is big. You just won’t believe how vastly, hugely, mind- bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist’s, but that’s just peanuts to space.

So.. black background and other colors, I’ve tried out several color schemes from colorbrewer and this sort of looked nice.

create_hhgttg_plot <- function(data, title = "what an amazing title!", subtitle = "donkeyballs"){
  # setting some colors
  c3_purple <- "#7570b3"
  c3_orange <- "#d95f02"
  # making a igraph object  
  data %>% 
        as_tbl_graph() %>%
        ggraph(layout = "auto") + 
        geom_edge_link(aes(edge_width = sqrt(n)),colour = c3_purple, show.legend = FALSE) +
        geom_node_point(color = "white", size = 5,alpha = 4/5 ) +
        geom_node_label(aes(label = name),color = c3_orange, repel = TRUE,size = 5) +
        labs(
            title = title,
            subtitle = subtitle,
            caption = "Roel M. Hogervorst: https://blog.rmhogervorst.nl\nExtracted from the first 5 books from the HHGTTG-trilogy"
        )+
        theme_void()+
        theme(
            plot.background = element_rect(fill =  "black"),
            text = element_text(colour = c3_orange),
            plot.margin = unit(c(.3,.5,.1,.5), "cm")
        )
}
set.seed(42) # if I redo everything looks the same

A plot per character

The main character Arthur Dent:

person_actions %>% 
    filter(word1 == "arthur") %>%
    filter(word2 != "dent") %>% 
    filter(n >2) %>% 
    create_hhgttg_plot("Arthur Dent is perpetually confused", "He blinks, stares, glances and looks")
## Using `nicely` as default layout

What about the Hitchhikers’ guide writer Ford Prefect?

person_actions %>% 
    filter(word1 == "ford") %>%
    filter(!str_detect(word2, "prefect")) %>% 
    filter(n>2) %>% 
    create_hhgttg_plot("Ford is a man of action and disinterest", "He moves, he's hurled, he looks, frowns and shrugs")
## Using `nicely` as default layout

Than there is the the President of the Galaxy: Zaphod Beeblebrox.

person_actions %>% 
    filter(word1 == "zaphod") %>%
    filter(!str_detect(word2, "beeblebrox")) %>% 
    create_hhgttg_plot("Zaphod is more emotional", "He stares, is angry, bitterly and shrugs")
## Using `nicely` as default layout

Tricia McMillan, the smartest person in the book3 (after Marvin, the android with a brain the size of a galaxy; and should we count him as a person?), mostly just stands and watches what the others are doing. She has two names, and the two names do not appear to have overlapping words. Tricia is most commonly referred to simply as “Trillian”, a modification of her birth name, which she adopted because it sounded more “space-like”.

person_actions %>% 
    filter(word1 %in% c("tricia", "trillian")) %>% 
    filter(word2 != "mcmillan") %>% 
    create_hhgttg_plot("Tricia sighes", "quickly and quietly")
## Using `nicely` as default layout

What’s up with Marvin?

unnested_hhgttg %>% 
  filter(word1 == "marvin") %>%   # oops forgat to include him (it? not sure...)
  ungroup() %>% 
  inner_join(  parts_of_speech %>% 
                 filter(str_detect(pos, "Verb")) %>% 
                 group_by(word) %>%  
                 distinct(pos),
               by = c("word2"="word")) %>% 
  group_by(word1,word2) %>% 
  summarise(n = sum(n)) %>% 
  create_hhgttg_plot("Marvin mostly moves, but in a depressing way", "Marvin: 'Life, don't talk to me about life' ")
## Using `nicely` as default layout

looking at it from the word perspective

Using the parts of speech data frame in tidytext I can select all verbs and select verbs I’m interested in.

action_words <- 
    unnested_hhgttg %>% 
    ungroup() %>% 
    group_by(word1, word2) %>% 
    summarise(n = sum(n)) %>% 
    inner_join(parts_of_speech %>% filter(str_detect(pos, "Verb") ),by = c("word2"="word"))


# action_words %>% pull(word2) %>% unique() # choose some words
action_words %>% filter(n>3) %>% pull(word2) %>% unique()
##   [1] "bob"         "express"     "dent"        "ford"        "found"      
##   [6] "glanced"     "lay"         "nodded"      "noticed"     "realized"   
##  [11] "sat"         "shook"       "stared"      "heart"       "lot"        
##  [16] "fish"        "grill"       "machine"     "house"       "beep"       
##  [21] "pad"         "people"      "guide"       "hole"        "ship"       
##  [26] "whore"       "park"        "sight"       "mission"     "bit"        
##  [31] "click"       "hum"         "round"       "bank"        "program"    
##  [36] "panel"       "card"        "ground"      "leather"     "green"      
##  [41] "space"       "notice"      "worry"       "slid"        "gown"       
##  [46] "cloud"       "net"         "cake"        "floor"       "message"    
##  [51] "bowl"        "deck"        "swish"       "continued"   "leapt"      
##  [56] "moved"       "shrugged"    "mouse"       "scout"       "fire"       
##  [61] "gargle"      "howl"        "silver"      "bail"        "blur"       
##  [66] "shore"       "swung"       "sandwich"    "wall"        "laugh"      
##  [71] "yellow"      "bypass"      "cream"       "drive"       "field"      
##  [76] "cruise"      "spirit"      "mail"        "war"         "forward"    
##  [81] "foot"        "hand"        "head"        "supervising" "form"       
##  [86] "support"     "cricket"     "level"       "block"       "dot"        
##  [91] "air"         "mash"        "matter"      "sky"         "oil"        
##  [96] "call"        "earth"       "cup"         "voice"       "skin"       
## [101] "ball"        "plain"       "ring"        "class"       "band"       
## [106] "proof"       "bag"         "cough"       "travel"      "lined"      
## [111] "stomp"       "throb"       "occurred"    "realised"    "bomb"       
## [116] "flop"        "light"       "tape"        "lipped"      "journey"    
## [121] "answer"      "question"    "arm"         "minor"       "screen"     
## [126] "captain"     "guard"       "weave"       "glass"       "pillar"     
## [131] "gun"
action_words %>% 
    filter(word2 == "realised") %>% 
    create_hhgttg_plot("Tricia, Arthur and Ford realise", 
                       "")
## Using `nicely` as default layout

action_words %>% 
    filter(word2 == "noticed") %>% 
    create_hhgttg_plot("Arthur and Ford notice", "Arthur noticed a lot")
## Using `nicely` as default layout

action_words %>% 
    filter(word2 == "leapt") %>% 
    create_hhgttg_plot("Not only Zaphod and Arthur leap", "civilization, beer, gold, rocket and flames leap too")
## Using `nicely` as default layout

action_words %>% 
    filter(word2 == "shrugged") %>% 
    create_hhgttg_plot("Who shrugs?", "Mostly Ford")
## Using `nicely` as default layout

action_words %>% 
    filter(word2 == "moved") %>% 
    filter(n>1) %>% 
    create_hhgttg_plot("Who moved?", subtitle = "only the male main characters")
## Using `nicely` as default layout

He vs She

Let’s try again with the basics, he vs she. She is used a lot less. he is used 1413 times, and she 373 times. And disproportionately in the last book ‘mostly harmless’. I recall that there is a new character in the final book, Random, the daughter of Arthur.

she_he <- HHGTTG %>% 
    unnest_tokens(output = "bigram", content, token = "ngrams", n=2) %>% 
    group_by(book) %>% 
    count(bigram,sort = TRUE) %>%
    separate(bigram, c("word1", "word2"), sep = " ") %>% 
  filter(word1 %in% c("he", "she") )

she_he %>% 
  filter(word1 == "she") %>% 
  count(book)
## # A tibble: 5 x 2
## # Groups:   book [5]
##   book                                       nn
##   <chr>                                   <int>
## 1 1 Hitchhiker's Guide to the Galaxy         27
## 2 2 Restaurant at the End of the Universe    16
## 3 3 Life, the Universe and Everything        43
## 4 4 So Long, and Thanks for All the Fish     97
## 5 5 Mostly Harmless                         190

What verbs are more he-like, and she-like?

I threw away the he and she when I did a anti_join of stop words, so let’s start over again.

I’m following parts of the procedures from here: https://www.tidytextmining.com/twitter.html#word-frequencies-1

To account for the difference in numbers I calculate the frequency per he and per she and divide those 2.

logratio_she_he <- 
  she_he %>% 
  group_by(word1,word2) %>% 
  summarise(n = sum(n)) %>%  #stopped caring about the book
  group_by(word1) %>% 
  mutate(
    total_count = sum(n),
    freq_type = n/total_count
  ) %>% 
  group_by(word2) %>% 
  mutate(total_count_w = sum(n)) %>% 
  filter(total_count_w >3) %>% 
  inner_join(  parts_of_speech %>% 
                 filter(str_detect(pos, "Verb")) %>% 
                 group_by(word) %>%  
                 distinct(word),
               by = c("word2"="word")) %>% # only keep verbs
  select(word1, word2, freq_type, total_count_w) %>% 
  spread(word1, freq_type) %>% 
  filter(!is.na(he) & !is.na(she)) %>% 
  ungroup() %>% 
  mutate(
    likelihood = he/she,
    logratio = log(he / she),
    label = reorder(word2, logratio)
    )
logratio_she_he %>%  
  filter(total_count_w > 25) %>% # select only more occuring words
  ggplot(aes(label, logratio, size = total_count_w)) +
  geom_point(aes(color = logratio < 0),show.legend = FALSE)+
  geom_hline(yintercept = 0, color = "#d95f02")+
  coord_flip()+
  labs(
    title = "Log odds ratio he/she ",
    subtitle = "Left from the line is more typical for he, right for she",
    caption = "Roel M. Hogervorst: https://blog.rmhogervorst.nl\nExtracted from the first 5 books from the HHGTTG-trilogy",
    x = "", y = "log ratio (left [green] he, right [yellow] she )"
  )+
    scale_color_manual(name = "", values = c("yellow", "green") )+ # define ugly colors. 
  theme_dark() +
  theme(
            plot.background = element_rect(fill =  "black"),
            panel.background = element_rect(fill =  "black"),
            text = element_text(colour = "#d95f02"),
            plot.margin = unit(c(.3,.5,.1,.5), "cm"),
            axis.text = element_text(colour = "#d95f02")
        )

Final thoughts

I think it’s a nice way to visualize information from a book. Some characters are more active and others more emotional.

References

State of the machine

At the moment of creation (when I knitted this document ) this was the state of my machine: click here to expand

sessioninfo::session_info()
## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.1 (2018-07-02)
##  os       Ubuntu 16.04.5 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_US                       
##  collate  en_US.UTF-8                 
##  tz       Europe/Amsterdam            
##  date     2018-07-24                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version    date       source                           
##  assertthat    0.2.0      2017-04-11 CRAN (R 3.5.0)                   
##  backports     1.1.2      2017-12-13 CRAN (R 3.5.0)                   
##  bindr         0.1.1      2018-03-13 CRAN (R 3.5.0)                   
##  bindrcpp    * 0.2.2      2018-03-29 CRAN (R 3.5.0)                   
##  blogdown      0.8        2018-07-15 CRAN (R 3.5.1)                   
##  bookdown      0.7        2018-02-18 CRAN (R 3.5.0)                   
##  broom         0.4.5      2018-07-03 CRAN (R 3.5.1)                   
##  cellranger    1.1.0      2016-07-27 CRAN (R 3.5.0)                   
##  cli           1.0.0      2017-11-05 CRAN (R 3.5.0)                   
##  clisymbols    1.2.0      2017-05-21 CRAN (R 3.5.0)                   
##  colorspace    1.3-2      2016-12-14 CRAN (R 3.5.0)                   
##  crayon        1.3.4      2017-09-16 CRAN (R 3.5.0)                   
##  digest        0.6.15     2018-01-28 CRAN (R 3.5.0)                   
##  dplyr       * 0.7.6      2018-06-29 CRAN (R 3.5.1)                   
##  evaluate      0.10.1     2017-06-24 CRAN (R 3.5.0)                   
##  fansi         0.2.3      2018-05-06 CRAN (R 3.5.1)                   
##  farver        1.0        2018-07-16 Github (thomasp85/farver@d29b48c)
##  forcats     * 0.3.0      2018-02-19 CRAN (R 3.5.0)                   
##  foreign       0.8-70     2018-04-23 CRAN (R 3.5.0)                   
##  ggforce       0.1.3      2018-07-07 CRAN (R 3.5.1)                   
##  ggplot2     * 3.0.0      2018-07-03 cran (@3.0.0)                    
##  ggraph      * 1.0.2      2018-07-07 CRAN (R 3.5.1)                   
##  ggrepel       0.8.0      2018-05-09 CRAN (R 3.5.0)                   
##  glue          1.3.0      2018-07-18 Github (tidyverse/glue@66de125)  
##  gridExtra     2.3        2017-09-09 CRAN (R 3.5.0)                   
##  gtable        0.2.0      2016-02-26 CRAN (R 3.5.0)                   
##  haven         1.1.2      2018-06-27 CRAN (R 3.5.1)                   
##  hms           0.4.2      2018-03-10 CRAN (R 3.5.0)                   
##  htmltools     0.3.6      2017-04-28 CRAN (R 3.5.0)                   
##  httr          1.3.1      2017-08-20 CRAN (R 3.5.0)                   
##  igraph        1.2.1      2018-03-10 CRAN (R 3.5.0)                   
##  janeaustenr   0.1.5      2017-06-10 CRAN (R 3.5.0)                   
##  jsonlite      1.5        2017-06-01 CRAN (R 3.5.0)                   
##  knitr         1.20       2018-02-20 CRAN (R 3.5.0)                   
##  labeling      0.3        2014-08-23 CRAN (R 3.5.0)                   
##  lattice       0.20-35    2017-03-25 CRAN (R 3.5.0)                   
##  lazyeval      0.2.1      2017-10-29 CRAN (R 3.5.0)                   
##  lubridate     1.7.4      2018-04-11 CRAN (R 3.5.0)                   
##  magrittr      1.5        2014-11-22 CRAN (R 3.5.0)                   
##  MASS          7.3-50     2018-04-30 CRAN (R 3.5.0)                   
##  Matrix        1.2-14     2018-04-09 CRAN (R 3.5.0)                   
##  mnormt        1.5-5      2016-10-15 CRAN (R 3.5.0)                   
##  modelr        0.1.2      2018-05-11 CRAN (R 3.5.0)                   
##  munsell       0.5.0      2018-06-12 CRAN (R 3.5.0)                   
##  nlme          3.1-137    2018-04-07 CRAN (R 3.5.0)                   
##  pillar        1.3.0      2018-07-14 CRAN (R 3.5.1)                   
##  pkgconfig     2.0.1      2017-03-21 CRAN (R 3.5.0)                   
##  plyr          1.8.4      2016-06-08 CRAN (R 3.5.0)                   
##  psych         1.8.4      2018-05-06 CRAN (R 3.5.0)                   
##  purrr       * 0.2.5      2018-05-29 CRAN (R 3.5.0)                   
##  R6            2.2.2      2017-06-17 CRAN (R 3.5.0)                   
##  Rcpp          0.12.18    2018-07-23 cran (@0.12.18)                  
##  readr       * 1.1.1      2017-05-16 CRAN (R 3.5.0)                   
##  readxl        1.1.0      2018-04-20 CRAN (R 3.5.0)                   
##  reshape2      1.4.3      2017-12-11 CRAN (R 3.5.0)                   
##  rlang         0.2.1      2018-05-30 CRAN (R 3.5.0)                   
##  rmarkdown     1.10       2018-06-11 CRAN (R 3.5.0)                   
##  rprojroot     1.3-2      2018-01-03 CRAN (R 3.5.0)                   
##  rstudioapi    0.7        2017-09-07 CRAN (R 3.5.0)                   
##  rvest         0.3.2      2016-06-17 CRAN (R 3.5.0)                   
##  scales      * 0.5.0      2017-08-24 CRAN (R 3.5.0)                   
##  sessioninfo   1.0.0      2017-06-21 CRAN (R 3.5.1)                   
##  SnowballC     0.5.1      2014-08-09 CRAN (R 3.5.0)                   
##  stringi       1.2.4      2018-07-20 cran (@1.2.4)                    
##  stringr     * 1.3.1      2018-05-10 CRAN (R 3.5.0)                   
##  tibble      * 1.4.2      2018-01-22 CRAN (R 3.5.0)                   
##  tidygraph   * 1.1.0      2018-02-10 CRAN (R 3.5.0)                   
##  tidyr       * 0.8.1      2018-05-18 CRAN (R 3.5.0)                   
##  tidyselect    0.2.4      2018-02-26 CRAN (R 3.5.0)                   
##  tidytext    * 0.1.9      2018-05-29 CRAN (R 3.5.0)                   
##  tidyverse   * 1.2.1      2017-11-14 CRAN (R 3.5.0)                   
##  tokenizers    0.2.1      2018-03-29 CRAN (R 3.5.0)                   
##  tweenr        0.1.5.9999 2018-07-16 Github (thomasp85/tweenr@4d4f8d1)
##  units         0.6-0      2018-06-09 CRAN (R 3.5.0)                   
##  utf8          1.1.4      2018-05-24 CRAN (R 3.5.0)                   
##  viridis       0.5.1      2018-03-29 CRAN (R 3.5.0)                   
##  viridisLite   0.3.0      2018-02-01 CRAN (R 3.5.0)                   
##  withr         2.1.2      2018-03-15 CRAN (R 3.5.0)                   
##  xfun          0.3        2018-07-06 CRAN (R 3.5.1)                   
##  xml2          1.2.0      2018-01-24 CRAN (R 3.5.0)                   
##  yaml          2.1.19     2018-05-01 CRAN (R 3.5.0)


  1. and radio plays and a TV-show and a movie, but we don’t talk about the movie and TV-series, it’s like jar-jar binks. The radio plays are amazing though.

  2. find her at https://juliasilge.com and twitter

  3. a brilliant mathematician and astrophysicist

To leave a comment for the author, please follow the link and comment on their blog: Category R on Roel's R-tefacts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)