Statistics Sunday: Using Text Analysis to Become a Better Writer

August 19, 2018
By

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

Using Text Analysis to Become a Better Writer We all have words we love to use, and that we perhaps use too much. As an example: I have a tendency to use the same transitional statements, to the point that, before I submit a manuscript, I do a find all to see how many times I’ve used some of my favorites, e.g., additionally, though, and so on.

I’m sure we all have our own words we use way too often.

Text analysis can also be used to discover patterns in writing, and for a writer, may be helpful in discovering when we depend too much on certain words and phrases. For today’s demonstration, I read in my (still in-progress) novel – a murder mystery called Killing Mr. Johnson – and did the same type of text analysis I’ve been demonstrating in recent posts.

To make things easier, I copied the document into a text file, and used the read_lines and tibble functions to prepare data for my analysis.

setwd("~/Dropbox/Writing/Killing Mr. Johnson")

library(tidyverse)
KMJ_text <- read_lines('KMJ_full.txt')

KMJ <- tibble(KMJ_text) %>%
mutate(linenumber = row_number())

I kept my line numbers, which I could use in some future analysis. For now, I’m going to tokenize my data, drop stop words, and examine my most frequently used words.

library(tidytext)
KMJ_words <- KMJ %>%
unnest_tokens(word, KMJ_text) %>%
anti_join(stop_words)
## Joining, by = "word"
KMJ_words %>%
count(word, sort = TRUE) %>%
filter(n > 75) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() + xlab(NULL) + coord_flip()

Fortunately, my top 5 words are the names of the 5 main characters, with the star character at number 1: Emily is named almost 600 times in the book. It’s a murder mystery, so I’m not too surprised that words like “body” and “death” are also common. But I know that, in my fiction writing, I often depend on a word type that draws a lot of disdain from authors I admire: adverbs. Not all adverbs, mind you, but specifically (pun intended) the “-ly adverbs.”

ly_words <- KMJ_words %>%
filter(str_detect(word, ".ly")) %>%
count(word, sort = TRUE)

head(ly_words)
## # A tibble: 6 x 2
## word n
##
## 1 emily 599
## 2 finally 80
## 3 quickly 60
## 4 emily’s 53
## 5 suddenly 39
## 6 quietly 38

Since my main character is named Emily, she was accidentally picked up by my string detect function. A few other top words also pop up in the list that aren’t actually -ly adverbs. I’ll filter those out then take a look at what I have left.

filter_out <- c("emily", "emily's", "emily’s","family", "reply", "holy")

ly_words <- ly_words %>%
filter(!word %in% filter_out)

ly_words %>%
filter(n > 10) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() + xlab(NULL) + coord_flip()

I use “finally”, “quickly”, and “suddenly” far too often. “Quietly” is also up there. I think the reason so many writers hate on adverbs is because it can encourage lazy writing. You might write that someone said something quietly or softly, but is there a better word? Did they whisper? Mutter? Murmur? Hiss? Did someone “move quickly” or did they do something else – run, sprint, dash?

At the same time, sometimes adverbs are necessary. I mean, can I think of a complete sentence that only includes an adverb? Definitely. Still, it might become tedious if I keep depending on the same words multiple times, and when a fiction book (or really any kind of writing) is tedious, we often give up. These results give me some things to think about as I edit.

Still have some big plans on the horizon, including some new statistics videos, a redesigned blog, and more surprises later! Thanks for reading!

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)