WordPress WordCloud with R

[This article was first published on binfalse » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

These days one can frequently read about wordclouds created with R, initiated by the release of the wordcloud package by Ian Fellows on July 23rd. So here I am to put in my two cents.

I thought about creating a wordcloud of a complete blog history, so I build a script that connects to a MySQL database and grabs all published posts and pages. All articles are combined in an huge text, that, when purged from tags and special chars, is visualized as a wordcloud:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
library(RMySQL)
require(wordcloud)
require(RColorBrewer)

# special chars we want to delete
sent=c(“,”, \\.”, “;”, “=”, “:”, \\?”, “!”, “-“, \\(“, \\)”, \\*”, “&”, “%”, “$”, \\+”, “”“, “‘”, “<", ">“, “\\[“, “\\]”, “\\{“, “\\}”, “\\/”, “\\\”)
# wordpress bb-codes, also delete!
bbcd=c(“\\[cc.+?/cci?\\]”, “\\[latex.+?/latex\\]”, “\\[caption.+?/caption\\]”)
# and of course delet HTML tags
tags=c(“a”, “b”, “abbr”, “strong”, “em”, “i”, “p”, “more”, “td”, “table”, “tr”, “th”, “script”, “h1”, “h2”, “h3”, “h4”, “h5”, “h6”, “div”, “span”, “small”,”img”)
tags=paste(“]*>”, sep=””)
# combine all purge-regex’

repl=c(tags, bbcd, sent)

# connect to your DB
con <- dbConnect(MySQL(), user=“USER”, password=“PASSPHRASE”, dbname=“DB”, host=“HOST”)
# select all published articles
res <- dbGetQuery(con, “SELECT post_content, post_title FROM wp_posts WHERE post_status=’publish'”)
#combine them in a text
text=paste(as.matrix(res), collapse=” “)
dbDisconnect(con)

# replace all unwanted stuff
tmp=sapply(repl, function (r) text<<-gsub(r, ” “, text))
# here are our words:
words=table(strsplit(tolower(text), \\s+”))

# remove words with _bad_ chars (non utf-8 stuff)
words=words[nchar(names(words), “c”)==nchar(names(words), “b”)]
# remove words shorter then 4 chars
words=words[nchar(names(words), “c”)>3]
# remove words accuring less than 5 times
words=words[words>4]

# create the image
png(“/tmp/cloud.png”, width=580, height=580)
pal2 <- brewer.pal(8,“Set2”)
wordcloud(names(words), words, scale=c(9,.1),min.freq=3, max.words=Inf, random.order=F, rot.per=.3, colors=pal2)
dev.off()

Enough code, here is the result for my slight blog:

WordPress WordCloud with R

WordPress WordCloud with R

Smart image, isn’t it? Unfortunately it takes about 30 secs to generate it, otherwise it would be cool to create such a cloud live with for example rApache.

Download:
R: wordpress-wordcloud.R
(Please take a look at the man-page. Browse bugs and feature requests.)

To leave a comment for the author, please follow the link and comment on their blog: binfalse » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)