Building Wordclouds in R

August 28, 2015
By

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

In this article, I will show you how to use text data to build word clouds in R. We will use a dataset containing around 200k Jeopardy questions. The dataset can be downloaded here (thanks to reddit user trexmatt for providing the dataset).

We will require three packages for this: tm, SnowballC, and wordcloud.

First, let’s load the required libraries and read in the data.

library(tm)
library(SnowballC)
library(wordcloud)

jeopQ <- read.csv('JEOPARDY_CSV.csv', stringsAsFactors = FALSE)

The actual questions are available in the Question column.

Now, we will perform a series of operations on the text data to simplify it.
First, we need to create a corpus.

jeopCorpus <- Corpus(VectorSource(jeopQ$Question))

Next, we will convert the corpus to a plain text document.

jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)

Then, we will remove all punctuation and stopwords. Stopwords are commonly used words in the English language such as I, me, my, etc. You can see the full list of stopwords using stopwords('english').

jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english'))

Next, we will perform stemming. This means that all the words are converted to their stem (Ex: learning -> learn, walked -> walk, etc.). This will ensure that different forms of the word are converted to the same form and plotted only once in the wordcloud.

jeopCorpus <- tm_map(jeopCorpus, stemDocument)

Now, we will plot the wordcloud.

wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)

This will produce the following wordcloud:
Screen Shot 2015-08-28 at 11.58.29 PM

If you want to remove the words ‘the’ and ‘this’, you can include them in the removeWords function as follows:

jeopCorpus <- tm_map(jeopCorpus, removeWords, c('the', 'this', stopwords('english')))

There are a few ways to customize it.

  • scale: This is used to indicate the range of sizes of the words.
  • max.words and min.freq: These parameters are used to limit the number of words plotted. max.words will plot the specified number of words and discard least frequent terms, whereas, min.freq will discard all terms whose frequency is below the specified value.
  • random.order: By setting this to FALSE, we make it so that the words with the highest frequency are plotted first. If we don’t set this, it will plot the words in a random order, and the highest frequency words may not necessarily appear in the center.
  • rot.per: This value determines the fraction of words that are plotted vertically.
  • colors: The default value is black. If you want to use different colors based on frequency, you can specify a vector of colors, or use one of the pre-defined color palettes. You can find a list here.

That brings us to the end of this article. I hope you enjoyed it! As always, if you have questions, feel free to leave a comment or reach out to me on Twitter.

Note: I learnt this technique in The Analytics Edge course offered by MIT on edX. It is a great course and I highly recommend that you take it if you are interested in Data Science!

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)