Fun with Twitter

January 28, 2013
By

(This article was first published on Frank Portman, and kindly contributed to R-bloggers)

I’ve been playing around with the ‘twitteR’ package for R ever since I heard of its existence. Twitter is great and easy to mine because the messages are all-text and most people’s profiles are public. This process is made even easier with the ‘twitteR’ package, which takes advantage of the Twitter API.

After exploring some of the package’s capabilities, I decided to conduct a pretty basic sentiment analysis on some tweets with various hashtags. Specifically, I analyzed the polarity of each tweet – whether the tweet is positive, negative, or neutral.

The hashtags I used were: #YOLO, #FML, #blessed, #bacon

The actual script is fairly simple and repetitive but does yield some interesting results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
library(twitteR)
library(sentiment)
library(ggplot2)
library(RJSONIO)
library(wordcloud)

yolo_tweets <- searchTwitter("#yolo", n = 1500)
yolo_tweets <- twListToDF(yolo_tweets)
yolo_tweets <- yolo_tweets$text
yolo_emotions <- classify_emotion(yolo_tweets)
yolo_polarity <- classify_polarity(yolo_tweets)
yolo_polarity.new <- matrix(nrow = 1500, ncol = 2)
yolo_polarity.new[1:1500, 1] <- yolo_polarity[, 4]
yolo_polarity.new[1:1500, 2] <- "yolo"

fml_tweets <- searchTwitter("#fml", n = 1500)
fml_tweets <- twListToDF(fml_tweets)
fml_tweets <- fml_tweets$text
fml_emotions <- classify_emotion(fml_tweets)
fml_polarity <- classify_polarity(fml_tweets)
fml_polarity.new <- matrix(nrow = 1500, ncol = 2)
fml_polarity.new[1:1500, 1] <- fml_polarity[, 4]
fml_polarity.new[1:1500, 2] <- "fml"


blessed_tweets <- searchTwitter("#blessed", n = 1500)
blessed_tweets <- twListToDF(blessed_tweets)
blessed_tweets <- blessed_tweets$text
blessed_emotions <- classify_emotion(blessed_tweets)
blessed_polarity <- classify_polarity(blessed_tweets)
blessed_polarity.new <- matrix(nrow = 1500, ncol = 2)
blessed_polarity.new[1:1500, 1] <- blessed_polarity[, 4]
blessed_polarity.new[1:1500, 2] <- "blessed"


bacon_tweets <- searchTwitter("#bacon", n = 1500)
bacon_tweets <- twListToDF(bacon_tweets)
bacon_tweets <- bacon_tweets$text
bacon_emotions <- classify_emotion(bacon_tweets)
bacon_polarity <- classify_polarity(bacon_tweets)
bacon_polarity.new <- matrix(nrow = 1500, ncol = 2)
bacon_polarity.new[1:1500, 1] <- bacon_polarity[, 4]
bacon_polarity.new[1:1500, 2] <- "bacon"


polarities <- rbind(yolo_polarity.new, fml_polarity.new,
                    blessed_polarity.new, bacon_polarity.new)
colnames(polarities) <- c("Polarities", "Hashtag")

qplot(polarities[, 2], fill = polarities[, 1]) + xlab("Hashtags") +
      scale_fill_discrete(name = "Text Polarity") +
      ggtitle("Polarities of Different Hashtags on Twitter")

The histogram portrays some peculiar information. For one, all of these hashtags seemed to associate with positive messages. I did not expect #fml to be associated with positive results since the fmylife site is a place for people to post negative things that have happened to them. Nevertheless, the other hashtags had more positives which was expected.

Next, I decided to explore some of the functions of the ‘wordcloud’ package in R. In order to do so, I mined tweets that contained #rstats, and built a wordcloud that sized and placed words based on their frequencies.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
rstats_tweets <- searchTwitter("#rstats", n = 1500)
rstats_tweets <- twListToDF(rstats_tweets)
rstats_tweets <- rstats_tweets$text


rstats_corpus <- Corpus(VectorSource(rstats_tweets))
rstats_corpus <- tm_map(rstats_corpus,
                        function(x) iconv(enc2utf8(x), sub = "byte"))

tdm <- TermDocumentMatrix(rstats_corpus,
                          control = list(removePunctuation = TRUE,
                                         stopwords = c("rstats",
                                                       stopwords("english")),
                                         removeNumbers = TRUE,
                                         tolower = TRUE))

m = as.matrix(tdm)

word_freqs = sort(rowSums(m), decreasing=TRUE)

dm = data.frame(word=names(word_freqs), freq=word_freqs)

wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

I had to use the tm_map function to ensure all tweets were encoded properly, before using TermDocumentMatrix. As we can see, ‘shiny’ is by far the most popular word tweeted with ‘#rstats’. This should come as no surprise – Shiny is RStudio’s new and exciting way to integrate R with web applications.

To leave a comment for the author, please follow the link and comment on their blog: Frank Portman.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)