[This article was first published on R – Bora Beran, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Presidential election campaigns are heating up and you all know what that means.

Word clouds

I still remember visualizations summarizing Romney-Obama debates in 2012 but one thing that slightly bothered me back then was that almost everything I saw were just counting single words so even the one of the most memorable phrases in Obama’s campaign such as “Four more years” were lost.

Surely, there will be some interesting phrases candidates will use this year. How can you make sure that your analysis doesn’t miss them?

I am planning on doing a series of posts on text analytics but tokenization is an important component in text analysis so let’s start with answering this question first.

This analysis relies on two R packages.

library(tm)
library(RWeka)

I wrote a number of helper functions to make analysis steps easier to follow.

# Helper functions
removeStopWords <- function(string, words) {
stopifnot(is.character(string), is.character(words))
splt <- strsplit(string, " ", fixed = TRUE)
vapply(splt, function(x) paste(x[!tolower(x) %in% words], collapse = " "),character(1))
}

countWords <- function(y) { sapply(gregexpr(" ", y), function(x) { 1+sum(x>=0) } ) }

# Find N-grams within certain edit distance to avoid multiple subsets of same phrase
setToClosestMatch<-function(text) {sapply(seq_along(names(text)),function(x){
ll <- agrep(pattern=names(text)[x],
names(text)[-x],
value=T,max=list(ins=1,del=1,sub=1))
ifelse(!is.na(ll[which.max(nchar(names(text)))]),ll[which.max(nchar(names(text)))],names(text)[x])})}

# To remove an, the, is etc. after finding n-grams
reduceToRealNGrams <- function(ngrams,n){
ngrams[countWords(removeStopWords(ngrams, append(tm::stopwords("en"),c('going','can','will'))))>=n]}

# Tokenize 2 to 4 grams
NgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 4,delimiters = " \\r\\n\\t.,;:\"()?!"))


Let’s start the analysis. I had one file for each candidate, containing transcripts for 5 speeches they gave between June 2015 and April 2016.

# Load the documents from specified folder
docs <- Corpus(DirSource("C:/Users/bberan/Downloads/Transcripts/"))
# Convert to lower case
docs <- tm_map(docs, tolower)
# Remove common words : locations, candidates’ references to each other etc.
docs <- tm_map(docs, removeWords, c("des moines","new york","new hampshire","south carolina","united states","twin cities","san bernardino","hillary clinton","donald trump"))


This is an optional step that allows you to find phrases that differentiate presidential candidates by putting small weights on phrases used by both candidates.

# Create term document matrix with tf-idf weighting
tdm <- TermDocumentMatrix(docs, control = list(tokenize = NgramTokenizer,weighting = weightTfIdf))

m <- as.matrix(tdm)

# separate Hillary and Trump content and sort by frequency
hillary <- sort(m[,1], decreasing=TRUE)
trump <- sort(m[,2], decreasing=TRUE)

# Get top 250 most common N-grams for each candidate
hillaryTopN<-hillary[order(-nchar(names(hillary[1:250])), hillary[1:250])]
trumpTopN<-trump[order(-nchar(names(trump[1:250])), trump[1:250])]



Since we are looking for 2 to 4 grams, R will find “Make America”, “Make America great”, “Make America great again” as separate n-grams. This step consolidates all of them to Make America great again. It also gets rid of N-grams if after removal of stop words e.g. (a, and, the, or..) they become smaller than a 2-gram. For example a 3-gram like “this or that” would be dropped a part of this step. Removing stop words this late makes sure our phrase structure is not broken e.g. “to keep us safe” does not become “keep safe”.

# get rid of junk and overlapping n-grams
hillaryTopNConsolidated<- reduceToRealNGrams(unique(cbind(hillaryTopN,setToClosestMatch(hillaryTopN))[,2]),2)
trumpTopNConsolidated<- reduceToRealNGrams(unique(cbind(trumpTopN,setToClosestMatch(trumpTopN))[,2]),2)



Now we can write this data to CSV and read in Tableau to build a word cloud like this.

It gives a much better idea than individual words but can we build something that gives even more context?

Of course by that I don’t mean a bar chart version of this. Despite all the criticism in the visualization community, word clouds do a decent job of conveying the information contained in a table with words and their respective counts for a number of reasons:

• Counts in word clouds are not exact or reliable quantities. A speaker can repeat a word several times in a row trying to remember what to say next or when they are interrupted by the audience. Even if that doesn’t happen, whether someone said something 10 vs. 11 times hardly tells you anything meaningful. So when reading a word cloud, what people look for is whether things are roughly the same or vastly different. To use an analogy, for a thermometer reading, there is no reason to display the temperature to 6th decimal place if thermometer is only accurate to one decimal place. 6 decimal places give a false sense of accuracy and takes up unnecessary space. Large numbers with one decimal place is much more useful in conveying the information.
• Applying transformations like TF-IDF can change the word count values by orders of magnitude which makes accuracy of bar chart even less useful.
• If corpus is large enough, word frequencies follow a power law pattern. Power law distributions apply to many aspects of human systems. The best known example is economist Vilfredo Pareto’s observation  that wealth follows a “predictable imbalance”, with 20% of the population holding 80% of the wealth.The linguist George Zipf observed that word frequency also falls in a power law pattern, with a small number of high frequency words, a moderate number of common words and a very large number of low frequency words. Later Jacob Nielsen observed power law distributions also in web site page views which is why word clouds often work well highlighting the popular content on news aggregators or forums.

Maybe a good way of providing more context is finding the sentences that contains these common phrases that distinguish one candidate from another.

First I broke the text into sentences in R.

library(openNLP)
library(NLP)
library(tm)
splitToSentences <- function(text) {
sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = "en")
text <- as.String(text)
sentence.boundaries <- annotate(text, sentence_token_annotator)
sentences <- text[sentence.boundaries]
return(sentences)
}


Docs[[2]] in my script holds Trump’s speeches so to get the sentences for it:

splitToSentences(docs[[2]])

Then I wrote the results to CSV, imported the CSV into Tableau and wrote some custom SQL to do a join between sentences and n-grams using CONTAINS as join criteria.

A little formatting to give it newspaper styling and here is the final result.

You can click on the image to see it live on Tableau Public and download the workbook if you like. Selecting topics from the right will bring up all the sentences containing that phrase.

To leave a comment for the author, please follow the link and comment on their blog: R – Bora Beran.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)