The Life-Changing Magic of Tidying Text
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When I went to the rOpenSci unconference about a month ago, I started work with Dave Robinson on a package for text mining using tidy data principles. What is this tidy data you keep hearing so much about? As described by Hadley Wickham, tidy data has a specific structure:
- each variable is a column
- each observation is a row
- each type of observational unit is a table
This means we end up with a data set that is in a long, skinny format instead of a wide format. Tidy data sets are easier to work with, and this is no less true when one starts to work with text. Most of the tooling and infrastructure needed for text mining with tidy data frames already exists in packages like dplyr, broom, tidyr, and ggplot2. Our goal in writing the tidytext package is to provide functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. We got a great start on our work when we were at the unconference and recently finished getting the first release ready; this week, tidytext was released on CRAN!
Jane Austen’s Novels Can Be So Tidy
One of the important functions we knew we needed was something to unnest text by some chosen token. In my previous blog posts, I did this with a for
loop first and then with a function that involved several dplyr bits. In the tidytext package, there is a function unnest_tokens
which has this functionality; it restructures text into a one-token-per-row format. This function is a way to convert a dataframe with a text column to be one-token-per-row. Let’s look at an example using Jane Austen’s novels.
(What?! Surely you’re not tired of them yet?)
The janeaustenr package has a function austen_books
that returns a tidy dataframe of all of the novels. Let’s use that, annotate a linenumber
quantity to keep track of lines in the original format, use a regex to find where all the chapters are, and then unnest_tokens
.
This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, sentences, lines, paragraphs, or separation around a regex pattern.
Now that the data is in one-word-per-row format, the TIDY DATA MAGIC can happen and we can manipulate it with tidy tools like dplyr. For example, we can remove stop words (kept in the tidytext dataset stop_words
) with an anti_join
.
Then we can use count
to find the most common words in all of Jane Austen’s novels as a whole.
Sentiment analysis can be done as an inner join. Three sentiment lexicons are in the tidytext package in the sentiment
dataset. Let’s examine how sentiment changes changes during each novel. Let’s find a sentiment score for each word using the Bing lexicon, then count the number of positive and negative words in defined sections of each novel.
Now we can plot these sentiment scores across the plot trajectory of each novel.
This is similar to some of the plots I have made in previous posts, but the effort and time required to make it is drastically less. More importantly, the thinking required to make it comes much more easily because it all falls so naturally out of joins and other dplyr
verbs.
Looking at Units Beyond Words
Lots of useful work can be done by tokenizing at the word level, but sometimes it is useful or necessary to look at different units of text. For example, some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole. These algorithms try to understand that
I am not having a good day.
is a negative sentence, not a positive one, because of negation. The Stanford CoreNLP tools and the sentimentr R package (currently available on Github but not CRAN) are examples of such sentiment analysis algorithms. For these, we may want to tokenize text into sentences.
Let’s look at just one.
The sentence tokenizing does seem to have a bit of trouble with UTF-8 encoded text, especially with sections of dialogue; it does much better with punctuation in ASCII.
Near the beginning of this vignette, we used a regex to find where all the chapters were in Austen’s novels. We can use tidy text analysis to ask questions such as what are the most negative chapters in each of Jane Austen’s novels? First, let’s get the list of negative words from the Bing lexicon. Second, let’s make a dataframe of how many words are in each chapter so we can normalize for the length of chapters. Then, let’s find the number of negative words in each chapter and divide by the total words in each chapter. Which chapter has the highest proportion of negative words?
These are the chapters with the most negative words in each book, normalized for number of words in the chapter. What is happening in these chapters? In Chapter 29 of Sense and Sensibility Marianne finds out what an awful jerk Willoughby is by letter, and in Chapter 34 of Pride and Prejudice Mr. Darcy proposes for the first time (so badly!). Chapter 45 of Mansfield Park is almost the end, when Tom is sick with consumption and Mary is revealed as all greedy and a gold-digger, Chapter 15 of Emma is when horrifying Mr. Elton proposes, and Chapter 27 of Northanger Abbey is a short chapter where Catherine gets a terrible letter from her inconstant friend Isabella. Chapter 21 of Persuasion is when Anne’s friend tells her all about Mr. Elliott’s immoral past.
Interestingly, many of those chapters are very close to the ends of the novels; things tend to get really bad for Jane Austen’s characters before their happy endings, it seems. Also, these chapters largely involve terrible revelations about characters through letters or conversations about past events, rather than some action happening directly in the plot. All that, just with dplyr
verbs, because the data is tidy.
Networks of Words
Another function in tidytext is pair_count
, which counts pairs of items that occur together within a group. Let’s count the words that occur together in the lines of Pride and Prejudice.
This can be useful, for example, to plot a network of co-occuring words with the igraph and ggraph packages.
Ten/five/whatever thousand pounds a year!
Let’s do another one!
Lots of proper nouns are showing up in these network plots (Box Hill, Frank Churchill, Lady Catherine de Bourgh, etc.), and it is easy to pick out the main characters (Elizabeth, Emma). This type of network analysis is mainly showing us the important people and places in a text, and how they are related.
Word Frequencies
A common task in text mining is to look at word frequencies and to compare frequencies across different texts. We can do this using tidy data principles pretty smoothly. We already have Jane Austen’s works; let’s get two more sets of texts to compare to. Dave has just put together a new package to search and download books from Project Gutenberg through R; we’re going to use that because this is a better way to follow Project Gutenberg’s rules for robot access. And it is SO nice to use! First, let’s look at some science fiction and fantasy novels by H.G. Wells, who lived in the late 19th and early 20th centuries. Let’s get The Time Machine, The War of the Worlds, The Invisible Man, and The Island of Doctor Moreau.
Just for kicks, what are the most common words in these novels of H.G. Wells?
Now let’s get some well-known works of the Brontë sisters, whose lives overlapped with Jane Austen’s somewhat but who wrote in a bit of a different style. Let’s get Jane Eyre, Wuthering Heights, The Tenant of Wildfell Hall, Villette, and Agnes Grey.
What are the most common words in these novels of the Brontë sisters?
Well, Jane Austen is not going around talking about people’s HEARTS this much; I can tell you that right now. Those Brontë sisters, SO DRAMATIC. Interesting that “time” and “door” are in the top 10 for both H.G. Wells and the Brontë sisters. “Door”?!
Anyway, let’s calculate the frequency for each word for the works of Jane Austen, the Brontë sisters, and H.G. Wells.
I’m using str_extract
here because the UTF-8 encoded texts from Project Gutenberg have some examples of words with underscores around them to indicate emphasis (you know, like italics). The tokenizer treated these as words but I don’t want to count “_any_” separately from “any”. Now let’s plot.
Words that are close to the line in these plots have similar frequencies in both sets of texts, for example, in both Austen and Brontë texts (“miss”, “time”, “lady”, “day” at the upper frequency end) or in both Austen and Wells texts (“time”, “day”, “mind”, “brother” at the high frequency end). Words that are far from the line are words that are found more in one set of texts than another. For example, in the Austen-Brontë plot, words like “elizabeth”, “emma”, “captain”, and “bath” (all proper nouns) are found in Austen’s texts but not much in the Brontë texts, while words like “arthur”, “dark”, “dog”, and “doctor” are found in the Brontë texts but not the Austen texts. In comparing H.G. Wells with Jane Austen, Wells uses words like “beast”, “guns”, “brute”, and “animal” that Austen does not, while Austen uses words like “family”, “friend”, “letter”, and “agreeable” that Wells does not.
Overall, notice that the words in the Austen-Brontë plot are closer to the zero-slope line than in the Austen-Wells plot and also extend to lower frequencies; Austen and the Brontë sisters use more similar words than Austen and H.G. Wells. Also, you might notice the percent frequencies for individual words are different in one plot when compared to another because of the inner join; not all the words are found in all three sets of texts so the percent frequency is a different quantity.
Let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between Austen and the Brontë sisters, and between Austen and Wells?
The relationship between the word frequencies is different between these sets of texts, as it appears in the plots.
The End
There is another whole set of functions in tidytext for converting to and from document-term matrices. Many existing text mining data sets are in document-term matrices, or you might want such a matrix for a specific machine learning application. The tidytext package has tidy
functions for objects from the tm and quanteda packages so you can convert back and forth. (For more on the tidy
verb, see the broom package). This allows, for example, a workflow with easy reading, filtering, and processing to be done using dplyr and other tidy tools, after which the data can be converted into a document-term matrix for machine learning applications. For examples of working with objects from other text mining packages using tidy data principles, see the tidytext vignette on converting to and from document-term matrices.
Many thanks to rOpenSci for hosting the unconference where we started work on the tidytext package, and to Gabriela de Queiroz, who contributed to the package while we were at the unconference. I am super happy to have collaborated with Dave; it has been a delightful experience. The R Markdown file used to make this blog post is available here. I am very happy to hear feedback or questions!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.