Statistics Sunday: Creating Wordclouds
[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
library(quanteda) #install with install.packages("quanteda") if needed
data(data_corpus_inaugural)
speeches <- data_corpus_inaugural$documents
row.names(speeches) <- NULL
As you can see, this dataset has each Inaugural Address in a column called "texts," with year and President's name as additional variables. To do analysis on the words in speeches, and generate a wordcloud, we'll want to unnest the words in the texts column.
library(tidytext) library(tidyverse) speeches_tidy <- speeches %>% unnest_tokens(word, texts) %>% anti_join(stop_words) ## Joining, by = "word"
For our first wordcloud, let's see what are the most common words across all speeches.
library(wordcloud) #install.packages("wordcloud") if needed
speeches_tidy %>%
count(word, sort = TRUE) %>%
with(wordcloud(word, n, max.words = 50))
We could very easily create a wordcloud for one President specifically. For instance, let's create one for Obama, since he provides us with two speeches worth of words. But to take things up a notch, let's add sentiment information to our wordcloud. To do that, we'll use the comparison.cloud function; we'll also need the reshape2 library.
library(reshape2) #install.packages("reshape2") if needed
obama_words <- speeches_tidy %>%
filter(President == "Obama") %>%
count(word, sort = TRUE)
obama_words %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))) %>%
filter(n > 1) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red","blue"))
## Joining, by = "word"

