Relationship Extraction with Spacyr

[This article was first published on R on The Data Sandbox, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the continuation of the previous project were we scrapped the Cooper Mind website with the rvest package. Please refer to that posting for the necessary steps to obtain the verified character names.

As a reminder, this project was inspired by the work of Thu Vu were she created a network mapping of the characters in the Witcher series. I thought it would be interesting to do some recreation of this project but in R and with the Stormlight Archive book series.

For those unfamiliar with the series, it is an epic fantasy story sprawling over four main books at the time of the publishing of this post. Sanderson is a fantastic author and I feel that the Stormlight Archive is his best work.

Introduction

So in a previous post, we created a list of characters which will represent the nodes in our network graph. The next step in project is to create the edges. The edges represent the relationships between characters. In our graph, we are going to have the edges represent that strength of the relationships between characters. In order to determine these edge values, we will need to perform relationship extraction from the text with the spacyr package.

The spacyr package is simply a wrapper for the python spaCy library, with the following functionality:

  • tokenization
  • lemmatizing tokens
  • parsing dependencies (to determine grammatical structure)
  • extracting form named entities

It uses the reticulate to create the python environment. I have previously written a post about using the reticulate package for using python code in RMarkdown.

Inialization

We start with the loading of the necessary libraries to complete the project.

library(spacyr)
library(tidyverse)
library(data.table)
#necessary to create a corpus object
library(readtext)
library(quanteda)
library(rainette)

If you have an environment of python with a version a spaCy, you can pass the destination into the spacy_intialize function. If not, you need to use the spacy_install function to create a Conda environment that will also include the spaCy package. For this project, I let spacyr create the Conda environment for me. This process did take a while for me, so don’t be surprised if its the same for you.

spacy_install()

I have the name list from the web scraping post saved as a RDS files. RDS files are compressed text files which load quicker and take up much less space than a csv file.

names <- read_rds("data/names.RDS")

Text Reading

The first step is to read all the text files into the system. I found this interesting little snippet of code that allows you to create a list of all the text files in a specific folder. For this project, all the books were stored in a single data folder.

list_of_files <- list.files(path = ".", recursive = TRUE,
pattern = "\\.txt$",
full.names = TRUE)

With the list of files, we can use the map_df function from the purr package. The purr package is part of the tidyverse package, so we don’t need to load it separately. The map series of functions allows use to pass a vector of values and a function. Each value will than be passed to that function. The _df part of the function is just the requirement that the output be in the format of a dataframe.

The same task can be completed with a for loop but it is much faster in the map function as it utilize vectorization. Vectorization is the strategy of performing multiple operations rather than a single operation at the same time. I am not very familiar with the purr package so I plan to write a new article on the topic in the near future.

After all the books are read into memory, we need to create a corpus. A corpus is a large body of text, much like a library for the sorting and organization of books. This is completed with the corpus function from the quanteda package. This corpus structure is necessary to utilize functions from the spacyr package.

This organisational structure in the Corpus is why I needed to load the books in with the readtext function from the readtext package. I’ve tried many different methods to read the text(readlines, read_Lines, readfile) but none of the performed the proper way for the corpus function. There were plenty of issues, hours of difficulty which resulted in me referring to the quanteda package website. There I learnt about the readtext function and I worked flawlessly on the first time. Well I did find an issue with the default encoding not interpreting characters correctly, but this is was corrected easily.

When the time came to modeling, issues arose with the size of the Corpus. There is a limitation in spaCy, it will only work with text files less than 100,000,000 characters long. I think that each book was a little over twice that size. So i needed to batch the process by breaking the corpus up into smaller sections. This was done with the split_segments function from the rainette package. The function only accepts a split based on number of sentences, so I arrived at a value of 100,000 sentences per document.

corpus <- list_of_files %>%
map_df(readtext, encoding='utf-8') %>%
corpus() %>%
rainette::split_segments(segment_size = 100000)
## Splitting...
## Done.

With the books read into file, the corpus created and the corpus split into sections, we now have 18 documents. We can proceed to entity modeling with the spaCy functions.

Unfortunately, we still have size issues as passing the entire Corpus to be parsed is unaffected by the number of documents. So I needed to create simple for loop to analyze each document one at a time and bind the results to a data table. Data tables are like dataframes but the have some unique notation and increased performance.

The Corpus is parsed with the spacy_parse function. Setting pos and lemma to false should reduce performance time as the function doesn’t need to return dependency POS tagset and lemmatized tokens. The POS tagset refers to the type of word such as Noun, while the lemmatized token is the basis of a word such as for the toke “am” would be to token “be”. The parsing of the corpus takes a very long time.

df <- corpus[[1]] %>%
spacy_parse(pos = FALSE, lemma = FALSE)
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 3.1.3, language model: en_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")
for (i in 2:length(corpus)){
temp <- corpus[[i]] %>%
spacy_parse(pos = FALSE, lemma = FALSE)
df <- rbind(df,temp)}
rm(temp)

The parsing creates a object that acts very similarly as a data table. There is an entry for each word, which is more than what is required for this project. The original data table is preserved, in case we would like to reference a sentence in the corpus, and we create a filtered data table. The data table is filters the tokens in the names list and by the identified entity making sure it starts with person.

dfclean <- df %>%
filter(token %in% names,
str_starts(entity, "PERSON")) 

Relationship modelling

The final step is to create a model that will connect people in the data table. I have decided to use a sentence windows that creates a connection when two names are mentioned within that window.

This is another very time consuming tasks the requires two for loops. The first loop goes through all 17747 rows and its sentence id. A second for loop that excludes all rows all ready used in the first loop is used to compare a second sentence id. If the difference between the senetence ids is less than the windows, the tokens for these rows is added to a empty data table. If the difference is greater than the window size, we break the second for loop as all the sentence ids are incremental. It is not a very clear or smooth method but it works.

window_size <- 5
related <- data.table("Person1" = character(), "Person2" = character())
for(i in 1: (nrow(dfclean)-1)){
for(j in (i + 1):nrow(dfclean)){
if((dfclean$sentence_id[j] - dfclean$sentence_id[i]) < window_size){
related <- rbindlist(list(related, list(dfclean$token[i], dfclean$token[j])))
}
else{
break
}
}
}

The following is a sample of the data table we have created to build the relationships.

related %>% head()
## Person1 Person2
## 1: Jezrien Jezrien
## 2: Jezrien Jezrien
## 3: Jezrien Jezrien
## 4: Jezrien Jezrien
## 5: Jezrien Kalak
## 6: Jezrien Kalak

We can identify to issues with this sample. The first issue is when two of the same names are within the same window. We will have to filter out when ‘Person1’ is equal to ‘Person2’. The second issue is that we would actually like to aggregate the data. We would like a count of when two different names are in the same window. Both of these tasks are easy enough to solve using the built in data table notation. For more information on data tables, please refer to my previous post on the topic.

relatedagg <- related[Person1 != Person2,.N,by = c("Person1", "Person2")]
relatedagg %>% head()
## Person1 Person2 N
## 1: Jezrien Kalak 10
## 2: Kalak Jezrien 9
## 3: Cenn Stormfather 3
## 4: Stormfather Cenn 20
## 5: Kaladin Cenn 76
## 6: Cenn Kaladin 64

The final issue is for the relationships for ‘Person1’ and ‘Person2’ when their places are switched but that will be dealt with in the next post.

Conclusion

With some hard-work, we were able to create an organized Corpus of all the current 4 Stormlight Archive books. We were able to split this Corpus into smaller sized documents, making them easier to manage. The spacyr library was than used to model entities within the Corpus, identifying the tokens that represent people. The next step was to clean up the results, keeping only the verified characters names as tokens. We than used a model to developed relationships using a window. A relationship was created whenever two character names were mentioned in the same window. We than filtered out characters relationships to themselves and aggregated the data. The clear next step is to actually build the graphs with the characters as nodes and their relationships as edges. But that is a post for another day.

Photo by Alec Favale on Unsplash
To leave a comment for the author, please follow the link and comment on their blog: R on The Data Sandbox.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)