Statistics Sunday: Creating Wordclouds

Posted on June 10, 2018 by in R bloggers | 0 Comments

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Cloudy with a Chance of Words Lots of fun projects in the works, so today’s post will be short – a demonstration on how to create wordclouds, both with and without sentiment analysis results. While I could use song lyrics again, I decided to use a different dataset that comes with the quanteda packages: all 58 Inaugural Addresses, from Washington’s first speech in 1789 to Trump’s in 2017.

library(quanteda) #install with install.packages("quanteda") if needed

data(data_corpus_inaugural)
speeches <- data_corpus_inaugural$documents
row.names(speeches) <- NULL

As you can see, this dataset has each Inaugural Address in a column called "texts," with year and President's name as additional variables. To do analysis on the words in speeches, and generate a wordcloud, we'll want to unnest the words in the texts column.

library(tidytext)
library(tidyverse)

speeches_tidy <- speeches %>%
  unnest_tokens(word, texts) %>%
  anti_join(stop_words)

## Joining, by = "word"

For our first wordcloud, let's see what are the most common words across all speeches.

library(wordcloud) #install.packages("wordcloud") if needed

speeches_tidy %>%
  count(word, sort = TRUE) %>%
  with(wordcloud(word, n, max.words = 50))

While the language used by Presidents certainly varies by time period and the national situation, these speeches refer often to the people and the government; in fact, most of the larger words directly reference the United States and Americans. The speeches address the role of "president" and likely the "duty" that role entails. The word "peace" is only slightly larger than "war," and one could probably map out which speeches were given during wartime and which weren't.

We could very easily create a wordcloud for one President specifically. For instance, let's create one for Obama, since he provides us with two speeches worth of words. But to take things up a notch, let's add sentiment information to our wordcloud. To do that, we'll use the comparison.cloud function; we'll also need the reshape2 library.

library(reshape2) #install.packages("reshape2") if needed

obama_words <- speeches_tidy %>%
  filter(President == "Obama") %>%
  count(word, sort = TRUE)

obama_words %>%
  inner_join(get_sentiments("nrc") %>%
               filter(sentiment %in% c("positive",
                                       "negative"))) %>%
  filter(n > 1) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red","blue"))

## Joining, by = "word"