Detour in taste wordclouds

March 10, 2012
By

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

I read Mining Twitter for consumer attitudes towards hotels in my feed of R-bloggers. That reminded me that I intended to look at generating wordclouds for salt and MSG at some point. Salt, or sodium is linked to hypertension, which is linked to some diseases http://en.wikipedia.org/wiki/Complications_of_hypertension. It is a topic within governments and health organizations, but I have the feeling it is not so much an issue in the public. MSG, or mono sodium glutamate, is not an issue for the governments of health organisations, but has a bad name and is for some linked to the chinese restaurant syndrom.  Luckily there was an nice post to follow: Generating Twitter Wordclouds in R.
Salt
Neither @Salt nor #Salt are good when interested in salt taste. Hence the search is for #sodium

sodium.tweets <- searchTwitter('#sodium',n=1500)
sodium.texts <- laply(sodium.tweets, function(x) x$getText())
head(sodium.texts)
[1] "#Citric Acid #Sodium Bicarbonate http://t.co/QgJxSlGT HealthAid Vitamin C 1000mg - Effervescent (Blackcurrant Flavour) - 20 Tablets"
[2] "I dnt understand metro I can go on Facebook an Twitter but I can't call or text anybody #sodium"                                    
[3] "Get the facts on #sodium:http://t.co/Djc9rTEl #BCHC @TheHSF"                                                                        
[4] "#Sodium: How to tame your salt habit now? http://t.co/eFTl8yI1"                                                                     
[5] "#lol #funny #insta #instafunny #haha #smile #meme #chemistry #joke #sodium  http://t.co/pX404RhQ"                                   
[6] "@Astroboii07 #sodium. Haha. Tas bisaya daw. i-sudyum. Hahaha.  @andiedote @krizhsexy @mjpatingo  #building"                         
At this point I found the blog twitter to wordcloud, so I restarted and used those functions. The original is from Using Text Mining to Find Out What @RDataMining Tweets are About. There was a small bit of editing. Require(tm) and require(wordcloud) within the functions did not work, so I called on the libraries directly. The clouds had some links in them, shown as 'httpt' with some more text added (link to a chemistry joke) a function to remove those is added too.
library(tm)
library(wordcloud)

RemoveAtPeople <- function(tweet) {
gsub("@\\w+", "", tweet)
}

RemoveHTTP <- function(tweet) {
gsub("http[[:alnum:][:punct:]]+", "", tweet)
}

generateCorpus= function(df,my.stopwords=c()){
#The following is cribbed and seems to do what it says on the can
tw.corpus= Corpus(VectorSource(df))
# remove punctuation
tw.corpus = tm_map(tw.corpus, removePunctuation)
#normalise case
tw.corpus = tm_map(tw.corpus, tolower)
# remove stopwords
tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english'))
tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords)
tw.corpus
}
wordcloud.generate=function(corpus,min.freq=3){
doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
dm = as.matrix(doc.m)
# calculate the frequency of words
v = sort(rowSums(dm), decreasing=TRUE)
d = data.frame(word=names(v), freq=v)
#Generate the wordcloud
wc=wordcloud(d$word, d$freq, min.freq=min.freq)
wc
}
tweets.grabber=function(searchTerm,num=500){
rdmTweets = searchTwitter(searchTerm, n=num,.encoding='UTF-8')
tw.df=twListToDF(rdmTweets)
tweets <- as.vector(sapply(tw.df$text, RemoveAtPeople))
as.vector(sapply(tweets,RemoveHTTP))
}



tweets=tweets.grabber('sodium',num=500)
tweets <- tweets[-308] # tweet in wrong locale
wordcloud.generate(generateCorpus(tweets,'sodium'),3)
The ugly line which removed tweed 308 is because this is in the wrong locale. It gave an error. This is an error which is not simple to resolve, so I removed the offending tweet: R tm package invalid input in 'utf8towcs'
Error in FUN(X[[308L]], ...) : 
  invalid input 'That was too much sodium

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.