Using twitteR to see, what german press secretary tweets about

June 24, 2012
By

(This article was first published on fibosworld » R, and kindly contributed to R-bloggers)

Find the HTML-slides here, and the .Rmd-file that was used to generate here.

How to deal with .Rmd-files, see here

What this is about

These are my first steps to play around with the interface from R to twitter, using the twitteR-package.

We will load the latest 1500 (maximum the API allows) tweets from the user @RegSprecher, who is the spokesman of the German government and run some analysis, like:

  • What devices does he use to tweet?
  • What is the frequency of his tweets, and on which days was he most active?
  • Is there any particular pattern about when he tweets?
  • What is he tweeting about?

Load data

## text ## 1 @eigensinn83 War jedenfalls eine anregende Sonntagmorgenslektüre; ob es aber für den unmittelbaren politischen Durchbruch reicht ?

The latest 6 tweets of @RegSprecher

Which device does he use?

  • Mr. Seibert seems to love his iPad: The vast majority of tweets are issued by the iPad-App of twitter….

Analysis of frequency

  • The tweets are summarized by day, and the bars show the amount of tweets per day
  • There were some heavy days in mid-March and beginning of May, with over 40 tweets per day
  • Normal “twee-rate” seems to be between 3 and 10 per day
  • There are even some “idle” days, where he did not tweet at all – holidays?

At what time of day is he tweeting?

  • The left schart shows on which hours of the day Mr Seibert is tweeting. Clearly, there is less activity between 0 and 6 o’clock.
  • The righ chart shows this more clearly: Most tweets happen between 10:00 and 14:00. Then it is lunchbreak, and around 16:00 Mr Seibert get’s up to speed again, before his tweeting activity decays. Only at 20:00, there is another small spike. But the afternoon activity is not as pronounced as the morning activity.

What is he tweeting about?

## [1] "bpa" "bundesregierung" "deu" ## [4] "fragreg" "für" "kanzlerin" ## [7] "mehr" "merkel" "neue" ## [10] "über" "uhr" 
## merkel kanzlerin ## 1.00 0.73 
  • The most frequent terms are shown above ; they occur each more than 30 times.
  • Interestingly, there is no mention of “financial crisis” or other political terms – Mr Seibert seems to focus on announcing “presentations” (präs) or press conferences (bpa) or Q&A sessions (fragreg).
  • We can also investigate which words are associated. The word “merkel” is associated with “kanzlerin”, which is no surprise…
  • The dendrogram shows other associations: Terms relating to press conferences are close (“fragreg”, “uhr” etc).
  • Interestingly, the term “DEU” (=Germany) is associated with “mehr” (=more). So, whatever is said, it is also more with Germany ;-)

——————————————————–
The Code below:

## @knitr setup
library(ggplot2)
library(scales)
library(lubridate)
library(twitteR)
library(gridExtra)
# set global chunk options
opts_chunk$set(fig.path='figure/slides-', cache.path='cache/slides-', cache=TRUE)
# upload images automatically
opts_knit$set(upload.fun = imgur_upload)

## @knitr load_data
# load tweets and convert to dataframe
regsprecher.tweets <- userTimeline("RegSprecher", n=1500)
regsprecher.tweets.df <- twListToDF(regsprecher.tweets)
regsprecher.tweets.df <- subset(regsprecher.tweets.df, created > ymd("2011-01-01")) # need to subset, because sometimes there are tweets from 2004...
#str(regsprecher.tweets.df)
print(head(regsprecher.tweets.df[,c(1,4,10)]))

## @knitr device
# Code from vignette of twitteR-package
  sources <- sapply(regsprecher.tweets, function(x) x$getStatusSource())
  sources <- gsub("", "", sources)
  sources <- strsplit(sources, ">")
  sources <- sapply(sources, function(x) ifelse(length(x) > 1, x[2], x[1]))
  pie(table(sources))

## @knitr freq
  ggplot() +
 geom_bar(aes(x = created),data=regsprecher.tweets.df,binwidth = 86400.0) +
 scale_y_continuous(name = 'Frequency, # tweets/day') +
 scale_x_datetime(name = 'Date',breaks = date_breaks(),labels = date_format(format = '%Y-%b'))

## @knitr time
plot1 <- ggplot() + geom_point(aes(x=created, y=hour(created)), data=regsprecher.tweets.df, alpha=0.5) +scale_y_continuous(name = 'Hour of day')
plot2 <- ggplot() +
 geom_bar(aes(x = hour(created)),data=regsprecher.tweets.df,binwidth = 1.0) +
 scale_x_continuous(name = 'Hour of day',breaks = c(c(0,6,10,12,8,14,16,18,20,
 22,2,4,24)),limits = c(0,24)) +
 scale_y_continuous(name = '# tweets')
grid.arrange(plot1, plot2, ncol=2)

## @knitr words
# this passage is entirely from http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/
require(tm)
# build a corpus
mydata.corpus <- Corpus(VectorSource(regsprecher.tweets.df$text))
# make each letter lowercase
mydata.corpus <- tm_map(mydata.corpus, tolower) 
# remove punctuation 
mydata.corpus <- tm_map(mydata.corpus, removePunctuation, preserve_intra_word_dashes=TRUE)
# remove generic and custom stopwords
my_stopwords <- c(stopwords('german'))
mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords)
# build a term-document matrix
mydata.dtm <- TermDocumentMatrix(mydata.corpus)
# inspect the document-term matrix
#mydata.dtm
# inspect most popular words
findFreqTerms(mydata.dtm, lowfreq=30)
findAssocs(mydata.dtm, 'merkel', 0.20)

## @knitr words2

# remove sparse terms to simplify the cluster plot
# Note: tweak the sparse parameter to determine the number of words.
# About 10-30 words is good.
mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.97)
# convert the sparse term-document matrix to a standard data frame
mydata.df <- as.data.frame(inspect(mydata.dtm2))
mydata.df.scale <- scale(mydata.df)
d <- dist(mydata.df.scale, method = "euclidean") # distance matrix

## @knitr words3
fit <- hclust(d, method="ward")
plot(fit) # display dendogram?

Created by Pretty R at inside-R.org


To leave a comment for the author, please follow the link and comment on his blog: fibosworld » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.