Pride and Prejudice and Z-scores

April 20, 2016

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

You might think literary criticism is no place for statistical analysis, but given digital versions of the text you can, for example, use sentiment analysis to infer the dramatic arc of an Oscar Wilde novel. Now you can apply similar techniques to the works of Jane Austen thanks to Julia Silge's R package janeaustenr (available on CRAN). The package includes the full text the 6 Austen novels, including Pride and Prejudice and Sense and Sensibility.

With the novels' text in hand, Julia then applied Bing sentiment analysis (as implemented in R's syuzhet package), shown here with annotations marking the major dramatic turns in the book:


There's quite a lot of noise in that chart, so Julia took the elegant step of using a low-pass fourier transform to smooth the sentiment for all six novels, which allows for a comparison of the dramatic arcs:


An apparent Austen afficionada, Julia interprets the analysis:

This is super interesting to me. Emma and Northanger Abbey have the most similar plot trajectories, with their tales of immature women who come to understand their own folly and grow up a bit. Mansfield Park and Persuasion also have quite similar shapes, which also is absolutely reasonable; both of these are more serious, darker stories with main characters who are a little melancholic. Persuasion also appears unique in starting out with near-zero sentiment and then moving to more dramatic shifts in plot trajectory; it is a markedly different story from Austen’s other works.

For more on the techniques of the analysis, including all the R code (plus some clever Austen-based puns), check out Julia's complete post linked below.

data science ish: If I Loved Natural Language Processing Less, I Might Be Able to Talk About It More

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)