# Detour in taste wordclouds

March 10, 2012
By

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

I read Mining Twitter for consumer attitudes towards hotels in my feed of R-bloggers. That reminded me that I intended to look at generating wordclouds for salt and MSG at some point. Salt, or sodium is linked to hypertension, which is linked to some diseases http://en.wikipedia.org/wiki/Complications_of_hypertension. It is a topic within governments and health organizations, but I have the feeling it is not so much an issue in the public. MSG, or mono sodium glutamate, is not an issue for the governments of health organisations, but has a bad name and is for some linked to the chinese restaurant syndrom.  Luckily there was an nice post to follow: Generating Twitter Wordclouds in R.
Salt
Neither @Salt nor #Salt are good when interested in salt taste. Hence the search is for #sodium

sodium.texts <- laply(sodium.tweets, function(x) x$getText()) head(sodium.texts) [1] "#Citric Acid #Sodium Bicarbonate http://t.co/QgJxSlGT HealthAid Vitamin C 1000mg - Effervescent (Blackcurrant Flavour) - 20 Tablets" [2] "I dnt understand metro I can go on Facebook an Twitter but I can't call or text anybody #sodium" [3] "Get the facts on #sodium:http://t.co/Djc9rTEl #BCHC @TheHSF" [4] "#Sodium: How to tame your salt habit now? http://t.co/eFTl8yI1" [5] "#lol #funny #insta #instafunny #haha #smile #meme #chemistry #joke #sodium http://t.co/pX404RhQ" [6] "@Astroboii07 #sodium. Haha. Tas bisaya daw. i-sudyum. Hahaha. @andiedote @krizhsexy @mjpatingo #building" At this point I found the blog twitter to wordcloud, so I restarted and used those functions. The original is from Using Text Mining to Find Out What @RDataMining Tweets are About. There was a small bit of editing. Require(tm) and require(wordcloud) within the functions did not work, so I called on the libraries directly. The clouds had some links in them, shown as 'httpt' with some more text added (link to a chemistry joke) a function to remove those is added too. library(tm) library(wordcloud) RemoveAtPeople <- function(tweet) { gsub("@\\w+", "", tweet) } RemoveHTTP <- function(tweet) { gsub("http[[:alnum:][:punct:]]+", "", tweet) } generateCorpus= function(df,my.stopwords=c()){ #The following is cribbed and seems to do what it says on the can tw.corpus= Corpus(VectorSource(df)) # remove punctuation tw.corpus = tm_map(tw.corpus, removePunctuation) #normalise case tw.corpus = tm_map(tw.corpus, tolower) # remove stopwords tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english')) tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords) tw.corpus } wordcloud.generate=function(corpus,min.freq=3){ doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1)) dm = as.matrix(doc.m) # calculate the frequency of words v = sort(rowSums(dm), decreasing=TRUE) d = data.frame(word=names(v), freq=v) #Generate the wordcloud wc=wordcloud(d$word, d$freq, min.freq=min.freq) wc } tweets.grabber=function(searchTerm,num=500){ rdmTweets = searchTwitter(searchTerm, n=num,.encoding='UTF-8') tw.df=twListToDF(rdmTweets) tweets <- as.vector(sapply(tw.df$text, RemoveAtPeople))
as.vector(sapply(tweets,RemoveHTTP))
}

tweets=tweets.grabber('sodium',num=500)
tweets <- tweets[-308] # tweet in wrong locale
wordcloud.generate(generateCorpus(tweets,'sodium'),3)
The ugly line which removed tweed 308 is because this is in the wrong locale. It gave an error. This is an error which is not simple to resolve, so I removed the offending tweet: R tm package invalid input in 'utf8towcs'
Error in FUN(X[[308L]], ...) :
invalid input 'That was too much sodium