by Wee Hyong Tok, Senior Data Scientist Manager at Microsoft
How do you read a novel in record time, and gain insights into the emotional journey of main characters, as they go through various trials and tribulations, as an exciting story unfolds from chapter to chapter?
I remembered my experiences when I start reading a novel, and I get intrigued by the story, and simply cannot wait to get to the last chapter. I also recall many conversations with friends on some of the interesting novels that I have read awhile back, and somehow have only vague recollection of what happened in a specific chapter. In this post, I’ll work through how we can use R to analyze the English translation of War and Peace.
War and Peace is a novel by Leo Tolstoy, and captures the salient points about Russian history from the period 1805 to 1812. The novel consists of the stories of five families, and captures the trials and tribulations of various characters (e.g. Natasha and Andre). The novel consists of about 1400 pages, and is one of the longest novels that have been written.
We hypothesize that if we can build a dashboard (shown below), this will allow us to gain insights into the emotional journey undertaken by the characters in War and Peace.
For example, using the dashboard below, and correlating it with the War and Peace story, we can see how Andrey falling in love with Natsha in Book 6 have created a green (positive) highlight in his life. We can also see how Natsha’s family bankruptcy in Book 6 overshadows her budding romance with Andrey. In addition, we can also observe how Pierre emotional journey stays red (negative), especially in Book 11-14, as he almost killed himself, and finally ending up in prison.
Building the War and Peace Visualization
In this post, I’ll show you how we use the d3heatmap package to build the War and Peace visualization. Let’s start with a dataset, where we used Azure Data Lake Analytics, and U-SQL extensions to analyze the English translation of War and Peace.
Let’s Get Started
The dataset (23Mb CSV) consists of 6 columns. V1, V2, V3 captures the year, book, and chapter respectively. V4 is a line of text from the novel. V5 denotes the key phrase extracted from each sentence, and V6 denotes the sentiment associated with the key phrase.
To get started, we will first specify the R libraries that we will use to create the War and Peace heatmap visualization.
Next, we load the data that is stored in Azure Blob storage.
We focus on some of the main characters that we want to analyze by specifying them, as follows:
Next, we find the rows in the dataset where each of the characters appeared.
In the book, a character might appear with different names. For example, Natalie and Natsha refer to the same person (depending on whether it is referred to in Russian or French). We attempt to resolve these same-name character references as follows:
Finally, we aggregate the sentiments of the characters by the book they appeared in, and then format it so that it can be used as inputs to d3heatmap. We used various useful functions, including melt and cast from the reshape package.
This reshapes the data from
to produce the following:
This is then used as inputs to d3heatmap to produce the War and Peace character sentiment visualization.
This produces the following heatmap.
I can now use the same approach to gain insights into any book, and understand which chapters of a novel I want to read!
- Find out how to use U-SQL in Azure Data Lake Analytics to detect key phrases.
- Download the dataset here.
- You can see this in action in Joseph Sirosh's keynote presentation at the Data Science Summit, where Matt Winkler demonstrated using how you can use Azure Data Lake Analytics to perform key phrase extractions and sentiment analysis at scale, over the entire collection of War and Peace books.