Arthur blinked, Ford shrugs, but Zaphod leapt; text as graph
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Can we make the computer say something about characters in a book? In this piece I will search for the names of characters and the words around those names in books. What can we learn about a character from text analysis? Of course it’s also just another excuse for me to read the Hitchhikers series! I will break down the text into chunks of two words, extract the word pairs that matter and visualize the results. Come an play along with your favorite book.
Inspiration
The Hitchhiker’s Guide to the Galaxy (HHGTTG for short) is a series of scifi novels1 about the last surviving man, Arthur Dent, after the destruction of earth. And I love it.
Arthur is an anti-hero, he’s not brave, he’s not smart, he sort of stumbles through life and things happen to him, he has almost no agency. That is something you realize while you are reading the book, but can we find that in the text as well? What can we find out about the actions of people in a book by looking at words around subjects in a book?
I’m inspired by the amazing text work of Julia Silge. For instance on “She giggles, he gallops”2 (link at the bottom of page, really check it out!) Julia and her co-writers scanned through 2000 movie manuscripts and extracted words around gender pronouns. They show that women will more likely snuggle, giggle, squeal, and sob, relative to men. Conversely, men are more likely to strap, gallop, shoot, howl, and kill. See more in she giggles, he gallops.
So what can we do with these books? I will walk through all the main characters (and Marvin) and at the end I also look at gender pronouns (he vs she). There is already a problem: there is only one woman in the cast. This could be a problem.
Many of the steps I take here are copied from the tidytextmining book.
Analyses
Approach
- I used the pubcrawl package to load in every book.
- use the packages tidygraph, tidyverse, tidytext, ggraph
- create bigrams of text, (go through and extract every two words)
- take more frequent subset of that
- select characters or actions and display them.
leading to:
Data
The books are of course copyrighted so I cannot share the files with you, that would be piracy. books loaded, cannot do it per chapter, because the chapters are messed up in some of the books. I do it per book. that’s life.
packages
- I use pubcrawl to read in the files see previous post
- tidytext for the tokenization, part of speech tagging
- tidygraph not that much, but for the conversion to graph (could do it with igraph too)
- ggraph to plot the images
- tidyverse (mostly dplyr, and ggplot, a bit of stringr ) for everything else
Loading the data
HHGTTG <- readr::read_rds("link_to_your_book")
Loading the packages
# I know, a lot! library(tidytext) library(tidygraph) library(ggraph) library(tidyverse) library(scales)
create bigrams of text, (go through and extract every two words)
unnested_hhgttg <- HHGTTG %>% unnest_tokens(output = "bigram", content, token = "ngrams", n=2) %>% group_by(book) %>% count(bigram,sort = TRUE) unnested_hhgttg %>% head() ## # A tibble: 6 x 3 ## # Groups: book [4] ## book bigram n ## <chr> <chr> <int> ## 1 5 Mostly Harmless of the 421 ## 2 3 Life, the Universe and Everything of the 385 ## 3 2 Restaurant at the End of the Universe of the 377 ## 4 1 Hitchhiker's Guide to the Galaxy of the 301 ## 5 5 Mostly Harmless it was 296 ## 6 5 Mostly Harmless in the 250
Oh no, too many stop words! We have to remove them with the stop words data set from tidytext.
unnested_hhgttg <- unnested_hhgttg %>% separate(bigram, c("word1", "word2"), sep = " ") %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) unnested_hhgttg %>% head() ## # A tibble: 6 x 4 ## # Groups: book [4] ## book word1 word2 n ## <chr> <chr> <chr> <int> ## 1 2 Restaurant at the End of the Universe ford prefect 53 ## 2 2 Restaurant at the End of the Universe zaphod beeblebrox 42 ## 3 3 Life, the Universe and Everything arthur dent 33 ## 4 5 Mostly Harmless perfectly normal 33 ## 5 1 Hitchhiker's Guide to the Galaxy ford prefect 28 ## 6 2 Restaurant at the End of the Universe arthur dent 26
Hmm these people are called by their complete name a lot.
Take only the most frequently used bigrams
I am no longer really interested in the book I found words in.
person_actions <- unnested_hhgttg %>% filter(n >1) %>% filter(word2 != "er") %>% # don't know what it is, but I don't want it ungroup() %>% filter(word1 %in% c("ford", "zaphod", "arthur", "tricia", "trillian")) %>% group_by(word1, word2) %>% summarise(n = sum(n)) head(person_actions,4) ## # A tibble: 4 x 3 ## # Groups: word1 [1] ## word1 word2 n ## <chr> <chr> <int> ## 1 arthur blinked 4 ## 2 arthur couldn’t 2 ## 3 arthur dent 93 ## 4 arthur dent's 4
Create a cool plotting theme
I will plot several of the same sort of network plots and I don’t want to set the theme every time. You could use: theme_set()
but I created a function here. the function takes data, and you can set a title and subtitle, no need for NSE here.
I use a black background, because, you know, space! it’s big! And empty, and empty is black, because absence of light is dark… Or something. And there is a lot of black because the Guide says:
Space is big. You just won’t believe how vastly, hugely, mind- bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist’s, but that’s just peanuts to space.
So.. black background and other colors, I’ve tried out several color schemes from colorbrewer and this sort of looked nice.
create_hhgttg_plot <- function(data, title = "what an amazing title!", subtitle = "donkeyballs"){ # setting some colors c3_purple <- "#7570b3" c3_orange <- "#d95f02" # making a igraph object data %>% as_tbl_graph() %>% ggraph(layout = "auto") + geom_edge_link(aes(edge_width = sqrt(n)),colour = c3_purple, show.legend = FALSE) + geom_node_point(color = "white", size = 5,alpha = 4/5 ) + geom_node_label(aes(label = name),color = c3_orange, repel = TRUE,size = 5) + labs( title = title, subtitle = subtitle, caption = "Roel M. Hogervorst: https://blog.rmhogervorst.nl\nExtracted from the first 5 books from the HHGTTG-trilogy" )+ theme_void()+ theme( plot.background = element_rect(fill = "black"), text = element_text(colour = c3_orange), plot.margin = unit(c(.3,.5,.1,.5), "cm") ) } set.seed(42) # if I redo everything looks the same
A plot per character
The main character Arthur Dent:
person_actions %>% filter(word1 == "arthur") %>% filter(word2 != "dent") %>% filter(n >2) %>% create_hhgttg_plot("Arthur Dent is perpetually confused", "He blinks, stares, glances and looks") ## Using `nicely` as default layout
What about the Hitchhikers’ guide writer Ford Prefect?
person_actions %>% filter(word1 == "ford") %>% filter(!str_detect(word2, "prefect")) %>% filter(n>2) %>% create_hhgttg_plot("Ford is a man of action and disinterest", "He moves, he's hurled, he looks, frowns and shrugs") ## Using `nicely` as default layout
Than there is the the President of the Galaxy: Zaphod Beeblebrox.
person_actions %>% filter(word1 == "zaphod") %>% filter(!str_detect(word2, "beeblebrox")) %>% create_hhgttg_plot("Zaphod is more emotional", "He stares, is angry, bitterly and shrugs") ## Using `nicely` as default layout
Tricia McMillan, the smartest person in the book3 (after Marvin, the android with a brain the size of a galaxy; and should we count him as a person?), mostly just stands and watches what the others are doing. She has two names, and the two names do not appear to have overlapping words. Tricia is most commonly referred to simply as “Trillian”, a modification of her birth name, which she adopted because it sounded more “space-like”.
person_actions %>% filter(word1 %in% c("tricia", "trillian")) %>% filter(word2 != "mcmillan") %>% create_hhgttg_plot("Tricia sighes", "quickly and quietly") ## Using `nicely` as default layout
What’s up with Marvin?
unnested_hhgttg %>% filter(word1 == "marvin") %>% # oops forgat to include him (it? not sure...) ungroup() %>% inner_join( parts_of_speech %>% filter(str_detect(pos, "Verb")) %>% group_by(word) %>% distinct(pos), by = c("word2"="word")) %>% group_by(word1,word2) %>% summarise(n = sum(n)) %>% create_hhgttg_plot("Marvin mostly moves, but in a depressing way", "Marvin: 'Life, don't talk to me about life' ") ## Using `nicely` as default layout
looking at it from the word perspective
Using the parts of speech data frame in tidytext I can select all verbs and select verbs I’m interested in.
action_words <- unnested_hhgttg %>% ungroup() %>% group_by(word1, word2) %>% summarise(n = sum(n)) %>% inner_join(parts_of_speech %>% filter(str_detect(pos, "Verb") ),by = c("word2"="word")) # action_words %>% pull(word2) %>% unique() # choose some words action_words %>% filter(n>3) %>% pull(word2) %>% unique() ## [1] "bob" "express" "dent" "ford" "found" ## [6] "glanced" "lay" "nodded" "noticed" "realized" ## [11] "sat" "shook" "stared" "heart" "lot" ## [16] "fish" "grill" "machine" "house" "beep" ## [21] "pad" "people" "guide" "hole" "ship" ## [26] "whore" "park" "sight" "mission" "bit" ## [31] "click" "hum" "round" "bank" "program" ## [36] "panel" "card" "ground" "leather" "green" ## [41] "space" "notice" "worry" "slid" "gown" ## [46] "cloud" "net" "cake" "floor" "message" ## [51] "bowl" "deck" "swish" "continued" "leapt" ## [56] "moved" "shrugged" "mouse" "scout" "fire" ## [61] "gargle" "howl" "silver" "bail" "blur" ## [66] "shore" "swung" "sandwich" "wall" "laugh" ## [71] "yellow" "bypass" "cream" "drive" "field" ## [76] "cruise" "spirit" "mail" "war" "forward" ## [81] "foot" "hand" "head" "supervising" "form" ## [86] "support" "cricket" "level" "block" "dot" ## [91] "air" "mash" "matter" "sky" "oil" ## [96] "call" "earth" "cup" "voice" "skin" ## [101] "ball" "plain" "ring" "class" "band" ## [106] "proof" "bag" "cough" "travel" "lined" ## [111] "stomp" "throb" "occurred" "realised" "bomb" ## [116] "flop" "light" "tape" "lipped" "journey" ## [121] "answer" "question" "arm" "minor" "screen" ## [126] "captain" "guard" "weave" "glass" "pillar" ## [131] "gun" action_words %>% filter(word2 == "realised") %>% create_hhgttg_plot("Tricia, Arthur and Ford realise", "") ## Using `nicely` as default layout
action_words %>% filter(word2 == "noticed") %>% create_hhgttg_plot("Arthur and Ford notice", "Arthur noticed a lot") ## Using `nicely` as default layout
action_words %>% filter(word2 == "leapt") %>% create_hhgttg_plot("Not only Zaphod and Arthur leap", "civilization, beer, gold, rocket and flames leap too") ## Using `nicely` as default layout
action_words %>% filter(word2 == "shrugged") %>% create_hhgttg_plot("Who shrugs?", "Mostly Ford") ## Using `nicely` as default layout
action_words %>% filter(word2 == "moved") %>% filter(n>1) %>% create_hhgttg_plot("Who moved?", subtitle = "only the male main characters") ## Using `nicely` as default layout
He vs She
Let’s try again with the basics, he vs she. She is used a lot less. he is used 1413 times, and she 373 times. And disproportionately in the last book ‘mostly harmless’. I recall that there is a new character in the final book, Random, the daughter of Arthur.
she_he <- HHGTTG %>% unnest_tokens(output = "bigram", content, token = "ngrams", n=2) %>% group_by(book) %>% count(bigram,sort = TRUE) %>% separate(bigram, c("word1", "word2"), sep = " ") %>% filter(word1 %in% c("he", "she") ) she_he %>% filter(word1 == "she") %>% count(book) ## # A tibble: 5 x 2 ## # Groups: book [5] ## book nn ## <chr> <int> ## 1 1 Hitchhiker's Guide to the Galaxy 27 ## 2 2 Restaurant at the End of the Universe 16 ## 3 3 Life, the Universe and Everything 43 ## 4 4 So Long, and Thanks for All the Fish 97 ## 5 5 Mostly Harmless 190
What verbs are more he-like, and she-like?
I threw away the he and she when I did a anti_join of stop words, so let’s start over again.
I’m following parts of the procedures from here: https://www.tidytextmining.com/twitter.html#word-frequencies-1
To account for the difference in numbers I calculate the frequency per he and per she and divide those 2.
logratio_she_he <- she_he %>% group_by(word1,word2) %>% summarise(n = sum(n)) %>% #stopped caring about the book group_by(word1) %>% mutate( total_count = sum(n), freq_type = n/total_count ) %>% group_by(word2) %>% mutate(total_count_w = sum(n)) %>% filter(total_count_w >3) %>% inner_join( parts_of_speech %>% filter(str_detect(pos, "Verb")) %>% group_by(word) %>% distinct(word), by = c("word2"="word")) %>% # only keep verbs select(word1, word2, freq_type, total_count_w) %>% spread(word1, freq_type) %>% filter(!is.na(he) & !is.na(she)) %>% ungroup() %>% mutate( likelihood = he/she, logratio = log(he / she), label = reorder(word2, logratio) ) logratio_she_he %>% filter(total_count_w > 25) %>% # select only more occuring words ggplot(aes(label, logratio, size = total_count_w)) + geom_point(aes(color = logratio < 0),show.legend = FALSE)+ geom_hline(yintercept = 0, color = "#d95f02")+ coord_flip()+ labs( title = "Log odds ratio he/she ", subtitle = "Left from the line is more typical for he, right for she", caption = "Roel M. Hogervorst: https://blog.rmhogervorst.nl\nExtracted from the first 5 books from the HHGTTG-trilogy", x = "", y = "log ratio (left [green] he, right [yellow] she )" )+ scale_color_manual(name = "", values = c("yellow", "green") )+ # define ugly colors. theme_dark() + theme( plot.background = element_rect(fill = "black"), panel.background = element_rect(fill = "black"), text = element_text(colour = "#d95f02"), plot.margin = unit(c(.3,.5,.1,.5), "cm"), axis.text = element_text(colour = "#d95f02") )
Final thoughts
I think it’s a nice way to visualize information from a book. Some characters are more active and others more emotional.
References
love the amount of detail in the wikipedia pages about HHGTTG
if you want to go crazy on tagging parts of speech (POS), you can use the udpipe package, it has POS annotations for many languages, automatic keyword extraction, topic modeling, lemmatization and dependency parsing.
State of the machine
At the moment of creation (when I knitted this document ) this was the state of my machine: click here to expand
sessioninfo::session_info() ## ─ Session info ────────────────────────────────────────────────────────── ## setting value ## version R version 3.5.1 (2018-07-02) ## os Ubuntu 16.04.5 LTS ## system x86_64, linux-gnu ## ui X11 ## language en_US ## collate en_US.UTF-8 ## tz Europe/Amsterdam ## date 2018-07-24 ## ## ─ Packages ────────────────────────────────────────────────────────────── ## package * version date source ## assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0) ## backports 1.1.2 2017-12-13 CRAN (R 3.5.0) ## bindr 0.1.1 2018-03-13 CRAN (R 3.5.0) ## bindrcpp * 0.2.2 2018-03-29 CRAN (R 3.5.0) ## blogdown 0.8 2018-07-15 CRAN (R 3.5.1) ## bookdown 0.7 2018-02-18 CRAN (R 3.5.0) ## broom 0.4.5 2018-07-03 CRAN (R 3.5.1) ## cellranger 1.1.0 2016-07-27 CRAN (R 3.5.0) ## cli 1.0.0 2017-11-05 CRAN (R 3.5.0) ## clisymbols 1.2.0 2017-05-21 CRAN (R 3.5.0) ## colorspace 1.3-2 2016-12-14 CRAN (R 3.5.0) ## crayon 1.3.4 2017-09-16 CRAN (R 3.5.0) ## digest 0.6.15 2018-01-28 CRAN (R 3.5.0) ## dplyr * 0.7.6 2018-06-29 CRAN (R 3.5.1) ## evaluate 0.10.1 2017-06-24 CRAN (R 3.5.0) ## fansi 0.2.3 2018-05-06 CRAN (R 3.5.1) ## farver 1.0 2018-07-16 Github (thomasp85/farver@d29b48c) ## forcats * 0.3.0 2018-02-19 CRAN (R 3.5.0) ## foreign 0.8-70 2018-04-23 CRAN (R 3.5.0) ## ggforce 0.1.3 2018-07-07 CRAN (R 3.5.1) ## ggplot2 * 3.0.0 2018-07-03 cran (@3.0.0) ## ggraph * 1.0.2 2018-07-07 CRAN (R 3.5.1) ## ggrepel 0.8.0 2018-05-09 CRAN (R 3.5.0) ## glue 1.3.0 2018-07-18 Github (tidyverse/glue@66de125) ## gridExtra 2.3 2017-09-09 CRAN (R 3.5.0) ## gtable 0.2.0 2016-02-26 CRAN (R 3.5.0) ## haven 1.1.2 2018-06-27 CRAN (R 3.5.1) ## hms 0.4.2 2018-03-10 CRAN (R 3.5.0) ## htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0) ## httr 1.3.1 2017-08-20 CRAN (R 3.5.0) ## igraph 1.2.1 2018-03-10 CRAN (R 3.5.0) ## janeaustenr 0.1.5 2017-06-10 CRAN (R 3.5.0) ## jsonlite 1.5 2017-06-01 CRAN (R 3.5.0) ## knitr 1.20 2018-02-20 CRAN (R 3.5.0) ## labeling 0.3 2014-08-23 CRAN (R 3.5.0) ## lattice 0.20-35 2017-03-25 CRAN (R 3.5.0) ## lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.0) ## lubridate 1.7.4 2018-04-11 CRAN (R 3.5.0) ## magrittr 1.5 2014-11-22 CRAN (R 3.5.0) ## MASS 7.3-50 2018-04-30 CRAN (R 3.5.0) ## Matrix 1.2-14 2018-04-09 CRAN (R 3.5.0) ## mnormt 1.5-5 2016-10-15 CRAN (R 3.5.0) ## modelr 0.1.2 2018-05-11 CRAN (R 3.5.0) ## munsell 0.5.0 2018-06-12 CRAN (R 3.5.0) ## nlme 3.1-137 2018-04-07 CRAN (R 3.5.0) ## pillar 1.3.0 2018-07-14 CRAN (R 3.5.1) ## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.5.0) ## plyr 1.8.4 2016-06-08 CRAN (R 3.5.0) ## psych 1.8.4 2018-05-06 CRAN (R 3.5.0) ## purrr * 0.2.5 2018-05-29 CRAN (R 3.5.0) ## R6 2.2.2 2017-06-17 CRAN (R 3.5.0) ## Rcpp 0.12.18 2018-07-23 cran (@0.12.18) ## readr * 1.1.1 2017-05-16 CRAN (R 3.5.0) ## readxl 1.1.0 2018-04-20 CRAN (R 3.5.0) ## reshape2 1.4.3 2017-12-11 CRAN (R 3.5.0) ## rlang 0.2.1 2018-05-30 CRAN (R 3.5.0) ## rmarkdown 1.10 2018-06-11 CRAN (R 3.5.0) ## rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.0) ## rstudioapi 0.7 2017-09-07 CRAN (R 3.5.0) ## rvest 0.3.2 2016-06-17 CRAN (R 3.5.0) ## scales * 0.5.0 2017-08-24 CRAN (R 3.5.0) ## sessioninfo 1.0.0 2017-06-21 CRAN (R 3.5.1) ## SnowballC 0.5.1 2014-08-09 CRAN (R 3.5.0) ## stringi 1.2.4 2018-07-20 cran (@1.2.4) ## stringr * 1.3.1 2018-05-10 CRAN (R 3.5.0) ## tibble * 1.4.2 2018-01-22 CRAN (R 3.5.0) ## tidygraph * 1.1.0 2018-02-10 CRAN (R 3.5.0) ## tidyr * 0.8.1 2018-05-18 CRAN (R 3.5.0) ## tidyselect 0.2.4 2018-02-26 CRAN (R 3.5.0) ## tidytext * 0.1.9 2018-05-29 CRAN (R 3.5.0) ## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.5.0) ## tokenizers 0.2.1 2018-03-29 CRAN (R 3.5.0) ## tweenr 0.1.5.9999 2018-07-16 Github (thomasp85/tweenr@4d4f8d1) ## units 0.6-0 2018-06-09 CRAN (R 3.5.0) ## utf8 1.1.4 2018-05-24 CRAN (R 3.5.0) ## viridis 0.5.1 2018-03-29 CRAN (R 3.5.0) ## viridisLite 0.3.0 2018-02-01 CRAN (R 3.5.0) ## withr 2.1.2 2018-03-15 CRAN (R 3.5.0) ## xfun 0.3 2018-07-06 CRAN (R 3.5.1) ## xml2 1.2.0 2018-01-24 CRAN (R 3.5.0) ## yaml 2.1.19 2018-05-01 CRAN (R 3.5.0)
and radio plays and a TV-show and a movie, but we don’t talk about the movie and TV-series, it’s like jar-jar binks. The radio plays are amazing though.↩
find her at https://juliasilge.com and twitter↩
a brilliant mathematician and astrophysicist↩
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.