In learning more about text mining over the past several months, one aspect of text that I’ve been interested in is readability. A text’s readability measures how hard or easy it is for a reader to read and understand what a text is saying; it depends on how sentences are written, what words are chosen, and so forth. I first became really aware of readability scores of books through my kids’ reading tracking websites for school, but it turns out there are lots of frameworks for measuring readability.
One of the most commonly used ways to measure readability is a SMOG grade, which stands for “Simple Measure of Gobbledygook”. It may have a silly (SILLY WONDERFUL) name, but it is often considered the gold standard of readability formulas and performs well in many contexts. We calculate a SMOG score using the formula
where the number in the numerator measures the number of words with 3 or more syllables and the number in the denominator measures the number of sentences. You can see that SMOG is going to be higher for texts with a lot of words with many syllables in each sentence. These ratios are typically normalized to use a sample of 30 sentences, and then the SMOG grade is supposed to estimate the years of education needed to understand a text.
This seems like it is perfectly suited to an analysis using tidy data principles, so let’s use the tidytext package to compare the readability of several texts.
Getting some texts to analyze
Let’s use the gutenbergr package to obtain some book texts to compare. I want to compare:
- Anne of Green Gables by L. M. Montgomery
- Little Women by Louisa May Alcott
- Pride and Prejudice by Jane Austen (I mean, DUH)
- A Portrait of the Artist as a Young Man by James Joyce
- Les Misérables by Victor Hugo
I really wanted to throw some Ernest Hemingway in there, but none of his works are on Project Gutenberg; I guess they are not public domain.
Tidying the text
Now we have our texts in hand, and we need to do some data wrangling to get it in the form that we need. We are interested in counting two things here:
- the number of sentences
- the number of words with 3 or more syllables
Let’s start by working with the sentences. The
unnest_tokens function in tidytext has an option to tokenize by sentences, but it can have trouble with UTF-8 encoded text, lots of dialogue, etc. We need to use
iconv first on the UTF-8 text from Project Gutenberg before trying to tokenize by sentences. Also, we have three different books in this dataframe, so we need to
map so that we count sentences separately for each book;
unnest_tokens will collapse all the text in a dataframe together before tokenizing by something like sentences, n-grams, etc.
It still takes me a bit of thinking and experimenting every time I need to
map, but what a great way to do what I need! How did this work out?
data column contains the original untidied text and the
tidied column contains the tidied text, organized with each sentence on its own row; both are list-columns. Now let’s unnest this so we get rid of the list-columns and have sentences in their own rows.
How did the sentence tokenizing do?
Pretty well! Especially considering the whole thing errors out without
Now we know how to count the number of sentences in each book.
There we go! An estimate of the number of sentences in each book.
The next thing we need to do here is count the syllables in each word so that we can find how many words in each book have more than 3 syllables. I did a bit of background checking on how this is done, and found this implementation of syllable counting by Tyler Kendall at the University of Oregon. It is actually an implementation in R of an algorithm originally written in PHP by Greg Fast, and it seems like a standard way people do this. It is estimated to have an error rate of ~15%, and is usually off by only one syllable when it is wrong.
I’m including this function in a code chunk with
echo = FALSE because it is really long and I didn’t write it, but you can check out the R Markdown file that made this blog post to see the details.
Let’s check out how it works!
Well, my last name is actually two syllables, but most human beings get that wrong too, so there we go.
Now let’s start counting the syllables in all the words in our books. Let’s use
unnest_tokens again to extract all the single words from the sentences; this time we will set
drop = FALSE so we keep the sentences for counting purposes. Let’s add a new column that will count the syllables for each word. (This takes a bit to run on my fairly speedy/new desktop; that function for counting syllables is not built for speed.)
Let’s check out the distributions of syllables for the three titles.
These distributions are pretty similar, but there are some moderate differences. Little Women and Les Misérables have the highest proportion of words with only one syllable, while Pride and Prejudice has the lowest proportion. This makes some sense, since Louisa May Alcott was writing for young readers while Jane Austen was not. Les Misérables was originally written in French and we are analyzing a translation here, so that is a complicating factor. James Joyce, with his moocows or whatever, is in the middle here.
Now we know both the number of sentences and the number of syllables in these books, so we can calculate… the gobbledygook! This will just end up being a bunch of dplyr operations.
L.M. Montgomery, writing here for an audience of young girls, has the lowest SMOG grade at around 9 (i.e., approximately beginning 9th grade level). Pride and Prejudice has the highest SMOG grade at 11.2, more than two years of education higher. I will say that throwing A Portrait of the Artist as a Young Man in here turned out to be an interesting choice; in reality, I find it to be practically unreadable but it has a readability score close to the same as Little Women. This measure of prose readability based only on number of sentences and number of words with lots of syllables doesn’t measure what we might expect when applied to extremely stylized text.
Let’s visualize the readability scores for these five novels.
I would like to thank Ben Heubl, a data journalist at The Economist, for interesting discussions that motivated this blog post. The R Markdown file used to make this blog post is available here. I am very happy to hear feedback or questions!