Using wordcloud on search terms & phrases

March 28, 2012
By

(This article was first published on The Schmitt-R, and kindly contributed to R-bloggers)

The wordcloud package for R is great, but all the examples I found used the tm package to process a large amount of textual data (web pages, text files, google docs, etc.)

But what if you have normalized data where you have a word and its frequency? Or, what if you have phrases that you want in a wordcloud? One example being terms which users have entered into a web search.

I happen to be pulling from a data source via PHP and then I output the data to CSV format in descending order by frequency.

The relevant part of the PHP script (after populating the array $terms):

$cwd = getcwd();
$local_path = $cwd.'/csv/';
$filename = $local_path.'searchterms.csv';
$fp = fopen($filename, 'w');
fputcsv($fp, array('term','freq'));
arsort($terms); //reverse sort array by values
$max_terms = 100;
$i = 0;
foreach ($terms as $q => $v) {
    $i++;
    if ($v > $min_freq) fputcsv($fp, array($q,$v));
    if ($i > $max_terms) break;
}
fclose($fp);

Here is the sample data:

term,freq
"target black friday",8239
"walmart layaway",6502
"america idol",1777
"american idol episodes",1741
"mexican train domino game",1585
"jc penny outlet store",1159
"the chicago code",1130
...

The R script:

require(wordcloud)
require(RColorBrewer)
datain <- read.csv("csv/searchterms.csv", colClasses=c("character", "numeric"))
pal2 <- brewer.pal(8,"Dark2")
png("wordcloud.png", width=1000,height=1000)
wordcloud(datain$term,datain$freq, scale=c(8,.4),min.freq=1, max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)
dev.off()

One consideration is that if a search phrase is too long, R will produce a warning and omit it from the resulting wordcloud, so you need to compensate with the image dimensions. It may be possible to dynamically scale the image based on the string length of the highest frequency result.

Here is the resulting wordcloud:

For more on R, visit http://www.r-bloggers.com/

To leave a comment for the author, please follow the link and comment on his blog: The Schmitt-R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.