June 24, 2012
By

(This article was first published on fibosworld » R, and kindly contributed to R-bloggers)

Find the HTML-slides here, and the .Rmd-file that was used to generate here.

How to deal with .Rmd-files, see here

These are my first steps to play around with the interface from R to twitter, using the twitteR-package.

We will load the latest 1500 (maximum the API allows) tweets from the user @RegSprecher, who is the spokesman of the German government and run some analysis, like:

• What devices does he use to tweet?
• What is the frequency of his tweets, and on which days was he most active?
• Is there any particular pattern about when he tweets?
• What is he tweeting about?

## text ## 1 @eigensinn83 War jedenfalls eine anregende Sonntagmorgenslektüre; ob es aber für den unmittelbaren politischen Durchbruch reicht ?

The latest 6 tweets of @RegSprecher

## Which device does he use?

• Mr. Seibert seems to love his iPad: The vast majority of tweets are issued by the iPad-App of twitter….

## Analysis of frequency

• The tweets are summarized by day, and the bars show the amount of tweets per day
• There were some heavy days in mid-March and beginning of May, with over 40 tweets per day
• Normal “twee-rate” seems to be between 3 and 10 per day
• There are even some “idle” days, where he did not tweet at all – holidays?

## At what time of day is he tweeting?

• The left schart shows on which hours of the day Mr Seibert is tweeting. Clearly, there is less activity between 0 and 6 o’clock.
• The righ chart shows this more clearly: Most tweets happen between 10:00 and 14:00. Then it is lunchbreak, and around 16:00 Mr Seibert get’s up to speed again, before his tweeting activity decays. Only at 20:00, there is another small spike. But the afternoon activity is not as pronounced as the morning activity.

## What is he tweeting about?

## [1] "bpa" "bundesregierung" "deu" ## [4] "fragreg" "für" "kanzlerin" ## [7] "mehr" "merkel" "neue" ## [10] "über" "uhr" 
## merkel kanzlerin ## 1.00 0.73 
• The most frequent terms are shown above ; they occur each more than 30 times.
• Interestingly, there is no mention of “financial crisis” or other political terms – Mr Seibert seems to focus on announcing “presentations” (präs) or press conferences (bpa) or Q&A sessions (fragreg).
• We can also investigate which words are associated. The word “merkel” is associated with “kanzlerin”, which is no surprise…
• The dendrogram shows other associations: Terms relating to press conferences are close (“fragreg”, “uhr” etc).
• Interestingly, the term “DEU” (=Germany) is associated with “mehr” (=more). So, whatever is said, it is also more with Germany

——————————————————–
The Code below:

## @knitr setup
library(ggplot2)
library(scales)
library(lubridate)
library(gridExtra)
# set global chunk options
opts_chunk$set(fig.path='figure/slides-', cache.path='cache/slides-', cache=TRUE) # upload images automatically opts_knit$set(upload.fun = imgur_upload)

# load tweets and convert to dataframe
regsprecher.tweets <- userTimeline("RegSprecher", n=1500)
regsprecher.tweets.df <- twListToDF(regsprecher.tweets)
regsprecher.tweets.df <- subset(regsprecher.tweets.df, created > ymd("2011-01-01")) # need to subset, because sometimes there are tweets from 2004...
#str(regsprecher.tweets.df)

## @knitr device
# Code from vignette of twitteR-package
sources <- sapply(regsprecher.tweets, function(x) x$getStatusSource()) sources <- gsub("", "", sources) sources <- strsplit(sources, ">") sources <- sapply(sources, function(x) ifelse(length(x) > 1, x[2], x[1])) pie(table(sources)) ## @knitr freq ggplot() + geom_bar(aes(x = created),data=regsprecher.tweets.df,binwidth = 86400.0) + scale_y_continuous(name = 'Frequency, # tweets/day') + scale_x_datetime(name = 'Date',breaks = date_breaks(),labels = date_format(format = '%Y-%b')) ## @knitr time plot1 <- ggplot() + geom_point(aes(x=created, y=hour(created)), data=regsprecher.tweets.df, alpha=0.5) +scale_y_continuous(name = 'Hour of day') plot2 <- ggplot() + geom_bar(aes(x = hour(created)),data=regsprecher.tweets.df,binwidth = 1.0) + scale_x_continuous(name = 'Hour of day',breaks = c(c(0,6,10,12,8,14,16,18,20, 22,2,4,24)),limits = c(0,24)) + scale_y_continuous(name = '# tweets') grid.arrange(plot1, plot2, ncol=2) ## @knitr words # this passage is entirely from http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/ require(tm) # build a corpus mydata.corpus <- Corpus(VectorSource(regsprecher.tweets.df$text))
# make each letter lowercase
mydata.corpus <- tm_map(mydata.corpus, tolower)
# remove punctuation
mydata.corpus <- tm_map(mydata.corpus, removePunctuation, preserve_intra_word_dashes=TRUE)
# remove generic and custom stopwords
my_stopwords <- c(stopwords('german'))
mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords)
# build a term-document matrix
mydata.dtm <- TermDocumentMatrix(mydata.corpus)
# inspect the document-term matrix
#mydata.dtm
# inspect most popular words
findFreqTerms(mydata.dtm, lowfreq=30)
findAssocs(mydata.dtm, 'merkel', 0.20)

## @knitr words2

# remove sparse terms to simplify the cluster plot
# Note: tweak the sparse parameter to determine the number of words.
# About 10-30 words is good.
mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.97)
# convert the sparse term-document matrix to a standard data frame
mydata.df <- as.data.frame(inspect(mydata.dtm2))
mydata.df.scale <- scale(mydata.df)
d <- dist(mydata.df.scale, method = "euclidean") # distance matrix

## @knitr words3
fit <- hclust(d, method="ward")
plot(fit) # display dendogram?

Created by Pretty R at inside-R.org