I like word clouds because they are visually appealing and provide a ton of information in a small space. Ever since I saw Drew Conway’s post (LINK) I have been looking for ways to improve word clouds. One of the nice feature’s of Drew’s post was that he colored the words according to the gradient. Unfortunately, Drew’s cloud lacks some of the aesthetic wow factor that Ian Fellow’s wordcloud package is known for.

This post is going to show you how to color words with a gradient based on degree of usage between two individuals. For me it’s going to help me learn the following things:

1. How to use knitr + markdown to make a blog post (I’ve been using knitr for reproducible latex/beamer reports).
2. How to use gradients in base (i.e. outside of ggplot2 that I’ve come to depend on).
3. How to make a gradient color bar in base.

First you’ll need some packages to get started. I’m using my own beta package qdap plus Fellow’s wordcloud packages. If you download qdap wordcloud is part of the install. For the legend we’ll be using the plotrix package.

 library(qdap)
library(wordcloud)
library(plotrix)


Now we’ll need some data. I happen to have presidential debate data (debate # 1) left over that we can still mine.

# download transcript of the debate to working directory
url_dl(pres.deb1.docx)

# load multiple files with read transcript and assign to working directory

# qprep for quick cleaning
dat1$dialogue <- qprep(dat1$dialogue)

left.just(htruncdf(dat1, 10, 45))

## Setting Up the Data

1. Make a word frequency matrix
2. Remove Lehrer’s words
3. Scale the word usage
4. Create a binned fill variable
word.freq <- with(dat1, wfdf(dialogue, person))[, -2]
csums <- colSums(word.freq[, -1])
conv.fact <- csums[2]/csums[1]
word.freq$ROMNEY2 <- word.freq[, "ROMNEY"] * conv.fact #colSums(word.freq[, -1]) word.freq[, "total"] <- rowSums(word.freq[, -1]) word.freq$continum <- with(word.freq, ROMNEY2-OBAMA)
word.freq <- word.freq[word.freq$total != 0,] #remove Leher only words MAX <- max(word.freq$continum[!is.infinite(word.freq$continum)]) word.freq$continum <- ifelse(is.infinite(word.freq$continum), MAX, word.freq$continum)
min.freq = 1, ordered.colors = TRUE, random.order = FALSE, rot.per=0,
scale = c(5, .7))
COLS <- colfunc(length(levels(word.freq\$fill.var)))
color.legend(.025, .025, .25, .04, qcv(Romney,Obama), COLS)


Note: If you plot to the console graphics device you can’t get a large enough size to plot all the words comfortably. I achieved the above results plotting externally to png @ 1000 x 1000 (w x h)

## Concluding Thoughts

Alright, this is my first knitr generated blog post. Very easy. I regret not having tried it earlier

I accomplished my goal of making a gradient word cloud and a gradient legend. The actual word cloud really isn’t that informative because there’re too many words and too little variation in word choice/colors. In some situations this approach may be useful but in this one I don’t like it. Secondly, I used the blue to red theme because it plays to the political parties but in this visualization better contrasting colors would be more appropriate. Overall I don’t feel I was successful in presenting information better than Drew Conway’s post.

## What the Reader Can Take Away from the Post

1. Using wordcloud’s user defined color feature
2. Using qdap’s lookup to recode
3. Creating gradients in base (easy)
4. Creating the accompanying gradient legend

If the reader has improvements in scaling, visualizing parameters ect. please share these and other comments below.