(This article was first published on fibosworld » R, and kindly contributed to R-bloggers)
Find the HTML-slides here, and the .Rmd-file that was used to generate here.
How to deal with .Rmd-files, see here
What this is about
These are my first steps to play around with the interface from R to twitter, using the twitteR-package.
We will load the latest 1500 (maximum the API allows) tweets from the user @RegSprecher, who is the spokesman of the German government and run some analysis, like:
- What devices does he use to tweet?
- What is the frequency of his tweets, and on which days was he most active?
- Is there any particular pattern about when he tweets?
- What is he tweeting about?
Load data
## text ## 1 @eigensinn83 War jedenfalls eine anregende Sonntagmorgenslektüre; ob es aber für den unmittelbaren politischen Durchbruch reicht ?
The latest 6 tweets of @RegSprecher
Which device does he use?
- Mr. Seibert seems to love his iPad: The vast majority of tweets are issued by the iPad-App of twitter….
Analysis of frequency
- The tweets are summarized by day, and the bars show the amount of tweets per day
- There were some heavy days in mid-March and beginning of May, with over 40 tweets per day
- Normal “twee-rate” seems to be between 3 and 10 per day
- There are even some “idle” days, where he did not tweet at all – holidays?
At what time of day is he tweeting?
- The left schart shows on which hours of the day Mr Seibert is tweeting. Clearly, there is less activity between 0 and 6 o’clock.
- The righ chart shows this more clearly: Most tweets happen between 10:00 and 14:00. Then it is lunchbreak, and around 16:00 Mr Seibert get’s up to speed again, before his tweeting activity decays. Only at 20:00, there is another small spike. But the afternoon activity is not as pronounced as the morning activity.
What is he tweeting about?
## [1] "bpa" "bundesregierung" "deu" ## [4] "fragreg" "für" "kanzlerin" ## [7] "mehr" "merkel" "neue" ## [10] "über" "uhr"## merkel kanzlerin ## 1.00 0.73
- The most frequent terms are shown above ; they occur each more than 30 times.
- Interestingly, there is no mention of “financial crisis” or other political terms – Mr Seibert seems to focus on announcing “presentations” (präs) or press conferences (bpa) or Q&A sessions (fragreg).
- We can also investigate which words are associated. The word “merkel” is associated with “kanzlerin”, which is no surprise…
- The dendrogram shows other associations: Terms relating to press conferences are close (“fragreg”, “uhr” etc).
- Interestingly, the term “DEU” (=Germany) is associated with “mehr” (=more). So, whatever is said, it is also more with Germany
——————————————————–
The Code below:
## @knitr setup library(ggplot2) library(scales) library(lubridate) library(twitteR) library(gridExtra) # set global chunk options opts_chunk$set(fig.path='figure/slides-', cache.path='cache/slides-', cache=TRUE) # upload images automatically opts_knit$set(upload.fun = imgur_upload) ## @knitr load_data # load tweets and convert to dataframe regsprecher.tweets <- userTimeline("RegSprecher", n=1500) regsprecher.tweets.df <- twListToDF(regsprecher.tweets) regsprecher.tweets.df <- subset(regsprecher.tweets.df, created > ymd("2011-01-01")) # need to subset, because sometimes there are tweets from 2004... #str(regsprecher.tweets.df) print(head(regsprecher.tweets.df[,c(1,4,10)])) ## @knitr device # Code from vignette of twitteR-package sources <- sapply(regsprecher.tweets, function(x) x$getStatusSource()) sources <- gsub("", "", sources) sources <- strsplit(sources, ">") sources <- sapply(sources, function(x) ifelse(length(x) > 1, x[2], x[1])) pie(table(sources)) ## @knitr freq ggplot() + geom_bar(aes(x = created),data=regsprecher.tweets.df,binwidth = 86400.0) + scale_y_continuous(name = 'Frequency, # tweets/day') + scale_x_datetime(name = 'Date',breaks = date_breaks(),labels = date_format(format = '%Y-%b')) ## @knitr time plot1 <- ggplot() + geom_point(aes(x=created, y=hour(created)), data=regsprecher.tweets.df, alpha=0.5) +scale_y_continuous(name = 'Hour of day') plot2 <- ggplot() + geom_bar(aes(x = hour(created)),data=regsprecher.tweets.df,binwidth = 1.0) + scale_x_continuous(name = 'Hour of day',breaks = c(c(0,6,10,12,8,14,16,18,20, 22,2,4,24)),limits = c(0,24)) + scale_y_continuous(name = '# tweets') grid.arrange(plot1, plot2, ncol=2) ## @knitr words # this passage is entirely from http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/ require(tm) # build a corpus mydata.corpus <- Corpus(VectorSource(regsprecher.tweets.df$text)) # make each letter lowercase mydata.corpus <- tm_map(mydata.corpus, tolower) # remove punctuation mydata.corpus <- tm_map(mydata.corpus, removePunctuation, preserve_intra_word_dashes=TRUE) # remove generic and custom stopwords my_stopwords <- c(stopwords('german')) mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords) # build a term-document matrix mydata.dtm <- TermDocumentMatrix(mydata.corpus) # inspect the document-term matrix #mydata.dtm # inspect most popular words findFreqTerms(mydata.dtm, lowfreq=30) findAssocs(mydata.dtm, 'merkel', 0.20) ## @knitr words2 # remove sparse terms to simplify the cluster plot # Note: tweak the sparse parameter to determine the number of words. # About 10-30 words is good. mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.97) # convert the sparse term-document matrix to a standard data frame mydata.df <- as.data.frame(inspect(mydata.dtm2)) mydata.df.scale <- scale(mydata.df) d <- dist(mydata.df.scale, method = "euclidean") # distance matrix ## @knitr words3 fit <- hclust(d, method="ward") plot(fit) # display dendogram?
Created by Pretty R at inside-R.org
To leave a comment for the author, please follow the link and comment on his blog: fibosworld » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...





Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).