Text analysis made too easy with the tm package
Today’s Gist takes the CNN transcript of the Denver Presidential Debate, converts paragraphs into a document-term matrix, and does the absolute most basic form of text analysis: a raw word count.
There are actually quite a few steps in this process, though it is made easier with reference to the tm vignette, but you would do well to update R, re-install the relevant packages, and make sure you have a recent version of Java installed on your computer: this code has lots of dependencies.
Please keep in mind that this Gist is intended only to illustrate the basic functionality of the tm package. Text analysis is difficult to do well, and a term frequency scatter plot does not qualify as “done well.” At least it’s not a Wordle (the mullet of the internet?)
To leave a comment
for the author, please follow the link and comment on his blog: is.R()
offers daily e-mail updates
news and tutorials
on topics such as: visualization (ggplot2
), programming (RStudio
, Web Scraping
) statistics (regression
, time series
) and more...
If you got this far, why not subscribe for updates
from the site? Choose your flavor: e-mail
, or facebook