WordPress WordCloud with R

August 3, 2011
By

(This article was first published on binfalse » R, and kindly contributed to R-bloggers)

These days one can frequently read about wordclouds created with R, initiated by the release of the wordcloud package by Ian Fellows on July 23rd. So here I am to put in my two cents.

I thought about creating a wordcloud of a complete blog history, so I build a script that connects to a MySQL database and grabs all published posts and pages. All articles are combined in an huge text, that, when purged from tags and special chars, is visualized as a wordcloud:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
library(RMySQL)
require(wordcloud)
require(RColorBrewer)

# special chars we want to delete
sent=c(",", "\\.", ";", "=", ":", "\\?", "!", "-", "\\(", "\\)", "\\*", "&", "%", "$", "\\+", """, "'", "<", ">", "\\[", "\\]", "\\{", "\\}", "\\/", "\\\")
# wordpress bb-codes, also delete!
bbcd=c("\\[cc.+?/cci?\\]", "\\[latex.+?/latex\\]", "\\[caption.+?/caption\\]")
# and of course delet HTML tags
tags=c("a", "b", "abbr", "strong", "em", "i", "p", "more", "td", "table", "tr", "th", "script", "h1", "h2", "h3", "h4", "h5", "h6", "div", "span", "small","img")
tags=paste("</?", tags, "[^>]*>", sep="")
# combine all purge-regex'

repl=c(tags, bbcd, sent)

# connect to your DB
con <- dbConnect(MySQL(), user="USER", password="PASSPHRASE", dbname="DB", host="HOST")
# select all published articles
res <- dbGetQuery(con, "SELECT post_content, post_title FROM wp_posts WHERE post_status='publish'")
#combine them in a text
text=paste(as.matrix(res), collapse=" ")
dbDisconnect(con)

# replace all unwanted stuff
tmp=sapply(repl, function (r) text<<-gsub(r, " ", text))
# here are our words:
words=table(strsplit(tolower(text), "\\s+"))

# remove words with _bad_ chars (non utf-8 stuff)
words=words[nchar(names(words), "c")==nchar(names(words), "b")]
# remove words shorter then 4 chars
words=words[nchar(names(words), "c")>3]
# remove words accuring less than 5 times
words=words[words>4]

# create the image
png("/tmp/cloud.png", width=580, height=580)
pal2 <- brewer.pal(8,"Set2")
wordcloud(names(words), words, scale=c(9,.1),min.freq=3, max.words=Inf, random.order=F, rot.per=.3, colors=pal2)
dev.off()

Enough code, here is the result for my slight blog:

WordPress WordCloud with R

WordPress WordCloud with R

Smart image, isn’t it? Unfortunately it takes about 30 secs to generate it, otherwise it would be cool to create such a cloud live with for example rApache.

Download:
R: wordpress-wordcloud.R
(Please take a look at the man-page. Browse bugs and feature requests.)

To leave a comment for the author, please follow the link and comment on his blog: binfalse » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , , , ,

Comments are closed.