Painting a picture of statistical packages

April 4, 2011
By

(This article was first published on eKonometrics, and kindly contributed to R-bloggers)

Imagine you have to analyze text comprising 18,000 words. You have to identify the most commonly cited ideas or words in the text and then present the analysis in a graphic format. There are sophisticated tools out there to help you with this task, but then again there is a tight deadline. You have fewer than five minutes to accomplish the task.

Generating a word cloud from the text may be one option. It is fast and the resulting output is appealing as well as informative. See the word cloud below, which I have generated from the description of 2,948 R packages listed at http://cran.r-project.org/web/packages. The one-liner description of these packages ran into 18,000-plus words. By using the free word cloud tool Wordle (http://www.wordle.net/), the task was accomplished in less than two minutes.

image

Based on the cloud we can see that the most frequent recurring themes in R packages are data, functions, models, estimation, regression, and Bayesian.

Wordle offers some control over the output. Consider the above cloud that was generated using the most common 150 words in the text. I eliminated ‘Analysis’ from the text since it was the most frequently repeated text. Later, I restricted the cloud to 100 most repeated words and removed restriction on  the word ‘Analysis’, and a randomly generated a word cloud. See the output below.

image

Notice the two variants of the word ‘data’ in the cloud. Wordle allows the user to eliminate any word in the generated cloud with a click of a mouse and retain the cleaned version of the cloud.

Also, don’t miss Drew Conway’s blog on building a more intelligent word clouds at http://www.drewconway.com/zia/?p=2624.

To leave a comment for the author, please follow the link and comment on his blog: eKonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.