Text analysis made too easy with the tm package

December 15, 2012

(This article was first published on is.R(), and kindly contributed to R-bloggers)

Today’s Gist takes the CNN transcript of the Denver Presidential Debate, converts paragraphs into a document-term matrix, and does the absolute most basic form of text analysis: a raw word count.

There are actually quite a few steps in this process, though it is made easier with reference to the tm vignette, but you would do well to update R, re-install the relevant packages, and make sure you have a recent version of Java installed on your computer: this code has lots of dependencies.

Please keep in mind that this Gist is intended only to illustrate the basic functionality of the tm package. Text analysis is difficult to do well, and a term frequency scatter plot does not qualify as “done well.” At least it’s not a Wordle (the mullet of the internet?)

To leave a comment for the author, please follow the link and comment on their blog: is.R().

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)