How to deal with .Rmd-files, see here
What this is about
These are my first steps to play around with the interface from R to twitter, using the twitteR-package.
We will load the latest 1500 (maximum the API allows) tweets from the user @RegSprecher, who is the spokesman of the German government and run some analysis, like:
- What devices does he use to tweet?
- What is the frequency of his tweets, and on which days was he most active?
- Is there any particular pattern about when he tweets?
- What is he tweeting about?
## text ## 1 @eigensinn83 War jedenfalls eine anregende Sonntagmorgenslektüre; ob es aber für den unmittelbaren politischen Durchbruch reicht ?
The latest 6 tweets of @RegSprecher
Which device does he use?
- Mr. Seibert seems to love his iPad: The vast majority of tweets are issued by the iPad-App of twitter….
Analysis of frequency
- The tweets are summarized by day, and the bars show the amount of tweets per day
- There were some heavy days in mid-March and beginning of May, with over 40 tweets per day
- Normal “twee-rate” seems to be between 3 and 10 per day
- There are even some “idle” days, where he did not tweet at all – holidays?
At what time of day is he tweeting?
- The left schart shows on which hours of the day Mr Seibert is tweeting. Clearly, there is less activity between 0 and 6 o’clock.
- The righ chart shows this more clearly: Most tweets happen between 10:00 and 14:00. Then it is lunchbreak, and around 16:00 Mr Seibert get’s up to speed again, before his tweeting activity decays. Only at 20:00, there is another small spike. But the afternoon activity is not as pronounced as the morning activity.
What is he tweeting about?
##  "bpa" "bundesregierung" "deu" ##  "fragreg" "für" "kanzlerin" ##  "mehr" "merkel" "neue" ##  "über" "uhr"
## merkel kanzlerin ## 1.00 0.73
- The most frequent terms are shown above ; they occur each more than 30 times.
- Interestingly, there is no mention of “financial crisis” or other political terms – Mr Seibert seems to focus on announcing “presentations” (präs) or press conferences (bpa) or Q&A sessions (fragreg).
- We can also investigate which words are associated. The word “merkel” is associated with “kanzlerin”, which is no surprise…
- The dendrogram shows other associations: Terms relating to press conferences are close (“fragreg”, “uhr” etc).
- Interestingly, the term “DEU” (=Germany) is associated with “mehr” (=more). So, whatever is said, it is also more with Germany
The Code below:
## @knitr setup library(ggplot2) library(scales) library(lubridate) library(twitteR) library(gridExtra) # set global chunk options opts_chunk$set(fig.path='figure/slides-', cache.path='cache/slides-', cache=TRUE) # upload images automatically opts_knit$set(upload.fun = imgur_upload) ## @knitr load_data # load tweets and convert to dataframe regsprecher.tweets <- userTimeline("RegSprecher", n=1500) regsprecher.tweets.df <- twListToDF(regsprecher.tweets) regsprecher.tweets.df <- subset(regsprecher.tweets.df, created > ymd("2011-01-01")) # need to subset, because sometimes there are tweets from 2004... #str(regsprecher.tweets.df) print(head(regsprecher.tweets.df[,c(1,4,10)])) ## @knitr device # Code from vignette of twitteR-package sources <- sapply(regsprecher.tweets, function(x) x$getStatusSource()) sources <- gsub("", "", sources) sources <- strsplit(sources, ">") sources <- sapply(sources, function(x) ifelse(length(x) > 1, x, x)) pie(table(sources)) ## @knitr freq ggplot() + geom_bar(aes(x = created),data=regsprecher.tweets.df,binwidth = 86400.0) + scale_y_continuous(name = 'Frequency, # tweets/day') + scale_x_datetime(name = 'Date',breaks = date_breaks(),labels = date_format(format = '%Y-%b')) ## @knitr time plot1 <- ggplot() + geom_point(aes(x=created, y=hour(created)), data=regsprecher.tweets.df, alpha=0.5) +scale_y_continuous(name = 'Hour of day') plot2 <- ggplot() + geom_bar(aes(x = hour(created)),data=regsprecher.tweets.df,binwidth = 1.0) + scale_x_continuous(name = 'Hour of day',breaks = c(c(0,6,10,12,8,14,16,18,20, 22,2,4,24)),limits = c(0,24)) + scale_y_continuous(name = '# tweets') grid.arrange(plot1, plot2, ncol=2) ## @knitr words # this passage is entirely from http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/ require(tm) # build a corpus mydata.corpus <- Corpus(VectorSource(regsprecher.tweets.df$text)) # make each letter lowercase mydata.corpus <- tm_map(mydata.corpus, tolower) # remove punctuation mydata.corpus <- tm_map(mydata.corpus, removePunctuation, preserve_intra_word_dashes=TRUE) # remove generic and custom stopwords my_stopwords <- c(stopwords('german')) mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords) # build a term-document matrix mydata.dtm <- TermDocumentMatrix(mydata.corpus) # inspect the document-term matrix #mydata.dtm # inspect most popular words findFreqTerms(mydata.dtm, lowfreq=30) findAssocs(mydata.dtm, 'merkel', 0.20) ## @knitr words2 # remove sparse terms to simplify the cluster plot # Note: tweak the sparse parameter to determine the number of words. # About 10-30 words is good. mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.97) # convert the sparse term-document matrix to a standard data frame mydata.df <- as.data.frame(inspect(mydata.dtm2)) mydata.df.scale <- scale(mydata.df) d <- dist(mydata.df.scale, method = "euclidean") # distance matrix ## @knitr words3 fit <- hclust(d, method="ward") plot(fit) # display dendogram?