Using wordcloud on search terms & phrases

March 28, 2012
By

(This article was first published on The Schmitt-R, and kindly contributed to R-bloggers)

The wordcloud package for R is great, but all the examples I found used the tm package to process a large amount of textual data (web pages, text files, google docs, etc.)

But what if you have normalized data where you have a word and its frequency? Or, what if you have phrases that you want in a wordcloud? One example being terms which users have entered into a web search.

I happen to be pulling from a data source via PHP and then I output the data to CSV format in descending order by frequency.

The relevant part of the PHP script (after populating the array $terms):

$cwd = getcwd();
$local_path = $cwd.’/csv/’;
$filename = $local_path.’searchterms.csv’;
$fp = fopen($filename, ‘w’);
fputcsv($fp, array(‘term’,’freq’));
arsort($terms); //reverse sort array by values
$max_terms = 100;
$i = 0;
foreach ($terms as $q => $v) {
    $i++;
    if ($v > $min_freq) fputcsv($fp, array($q,$v));
    if ($i > $max_terms) break;
}
fclose($fp);

Here is the sample data:

term,freq
“target black friday”,8239
“walmart layaway”,6502
“america idol”,1777
“american idol episodes”,1741
“mexican train domino game”,1585
“jc penny outlet store”,1159
“the chicago code”,1130

The R script:

require(wordcloud)
require(RColorBrewer)
datain <- read.csv(“csv/searchterms.csv”, colClasses=c(“character”, “numeric”))
pal2 <- brewer.pal(8,”Dark2″)
png(“wordcloud.png”, width=1000,height=1000)
wordcloud(datain$term,datain$freq, scale=c(8,.4),min.freq=1, max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)
dev.off()

One consideration is that if a search phrase is too long, R will produce a warning and omit it from the resulting wordcloud, so you need to compensate with the image dimensions. It may be possible to dynamically scale the image based on the string length of the highest frequency result.

Here is the resulting wordcloud:

For more on R, visit https://www.r-bloggers.com/

To leave a comment for the author, please follow the link and comment on their blog: The Schmitt-R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Mango solutions



plotly webpage

dominolab webpage



Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training

datasociety

http://www.eoda.de





ODSC

ODSC

CRC R books series





Six Sigma Online Training









Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)