Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Wee Hyong Tok, Senior Data Scientist Manager at Microsoft

How do you read a novel in record time, and gain insights into the emotional journey of main characters, as they go through various trials and tribulations, as an exciting story unfolds from chapter to chapter?

I remembered my experiences when I start reading a novel, and I get intrigued by the story, and simply cannot wait to get to the last chapter. I also recall many conversations with friends on some of the interesting novels that I have read awhile back, and somehow have only vague recollection of what happened in a specific chapter. In this post, I’ll work through how we can use R to analyze the English translation of War and Peace.

War and Peace is a novel by Leo Tolstoy, and captures the salient points about Russian history from the period 1805 to 1812. The novel consists of the stories of five families, and captures the trials and tribulations of various characters (e.g. Natasha and Andre). The novel consists of about 1400 pages, and is one of the longest novels that have been written.

We hypothesize that if we can build a dashboard (shown below), this will allow us to gain insights into the emotional journey undertaken by the characters in War and Peace.

For example, using the dashboard below, and correlating it with the War and Peace story, we can see how Andrey falling in love with Natsha in Book 6 have created a green (positive) highlight in his life. We can also see how Natsha’s family bankruptcy in Book 6 overshadows her budding romance with Andrey. In addition, we can also observe how Pierre emotional journey stays red (negative), especially in Book 11-14, as he almost killed himself, and finally ending up in prison.

## Building the War and Peace Visualization

In this post, I’ll show you how we use the d3heatmap package to build the War and Peace visualization. Let’s start with a dataset, where we used Azure Data Lake Analytics, and U-SQL extensions to analyze the English translation of War and Peace.

### Let’s Get Started

The dataset (23Mb CSV) consists of 6 columns. V1, V2, V3 captures the year, book, and chapter respectively. V4 is a line of text from the novel. V5 denotes the key phrase extracted from each sentence, and V6 denotes the sentiment associated with the key phrase.

To get started, we will first specify the R libraries that we will use to create the War and Peace heatmap visualization.

# Load libs
library (dplyr)
library(d3heatmap)
library(tidyr)
library(stringr)
library(reshape)


Next, we load the data that is stored in Azure Blob storage.

# Read the war and peace dataset, with key phrase and sentiment extracted


We focus on some of the main characters that we want to analyze by specifying them, as follows:

# Specify the characters in War and Peace
bookCharacters <- c("Cyril",  "Pierre",  "Nicholas",
"Rostóva","Natalie","Natasha","Catiche", "Ilyá",
"Nikolenka",  "Pétya","Véra","Sónya", "Nicholas",
"Alpatych", "Vasíli", "Anatole", "Leyla", "Borís","Mitenka",
"Berg", "Bourienne", "Lorrain", "Michael Ivánovich",
"Timókhin","Kozlovski", "Nesvítski", "Kirsten",  "Bilibin",  "Bagration",
"Murat",  "Tushin","Alpatych")


Next, we find the rows in the dataset where each of the characters appeared.

# Create an initial data frame
wpBase <- warpeace[grep("Mack",warpeace$V5),] wpBase$character <- "Mack"

# For each character in the book, find the rows with the specific book character
for (person in bookCharacters) {
wpNew <- warpeace[grep(person,warpeace$V5),] if ( nrow(wpNew) > 0 ) { wpNew$character <- person
wpBase <- rbind(wpBase,wpNew)
}
}


In the book, a character might appear with different names. For example, Natalie and Natsha refer to the same person (depending on whether it is referred to in Russian or French). We attempt to resolve these same-name character references as follows:

# Resolve to same-name references
wpBase$character[wpBase$character=="NatÃ¡sha"] <- "Natasha"
wpBase$character[wpBase$character=="Natalie"] <- "Natasha"

wpBase$character[wpBase$character=="Monsieur Pierre" ] <- "Pierre"
wpBase$character[wpBase$character=="Count BezÃºkhov"] <- "Pierre"

wpBase$character[wpBase$character=="Prince Andrew"] <- "Andrey"
wpBase$character[wpBase$character=="Andrew"] <- "Andrey"
wpBase$character[wpBase$character=="Andrey"] <- "Andrey"


Finally, we aggregate the sentiments of the characters by the book they appeared in, and then format it so that it can be used as inputs to d3heatmap. We used various useful functions, including melt and cast from the reshape package.

# Aggregate by character, and then order by book
warpeaceG <- wpBase %>% group_by(V2,character) %>% summarize(sentimentOverall=sum(V6))
warpeaceOrderByBook <- warpeaceG[order(warpeaceG\$V2),]

md <- melt(warpeaceOrderByBook, id = (c("V2", "character")))
castd <- cast(md, character ~ V2)

# Replace all NA with 0
castd[is.na(castd)] <- 0


This reshapes the data from

to produce the following:

This is then used as inputs to d3heatmap to produce the War and Peace character sentiment visualization.

d3heatmap(castd,scale = "column", colors = "Spectral",
dendrogram = "none", Rowv = FALSE, Colv = FALSE)


This produces the following heatmap.

I can now use the same approach to gain insights into any book, and understand which chapters of a novel I want to read!

Resources

• Find out how to use U-SQL in Azure Data Lake Analytics to detect key phrases.