Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is a lecture for IS624 Predictive Analytics, which is part of the CUNY Master’s program in Data Analytics.

Twitter is renowned for spawning vibrant communities and discussion of current events. Many services exist to track hashtags for popularity, but less is known about the statistical characteristics of the timelines associated with hashtags. Time series analysis sets the stage for understanding the properties of hashtags as a discrete phenomenon. However, no two hashtags are the same and this warrants different approaches for different hashtags. In this post, we’ll look at tweets related to R and data science using the query string “#rstats,#datascience,#bigdata,#machinelearning,#dataviz,#ml”. In Twitter’s API, this amounts to a disjunction of six search terms, so that a result is returned if any of the terms appear in a tweet. We’ll first collect tweet data for each time line and transform the JSON-like tree structure into a more analysis friendly data.frame. Then we’ll use some basic forecasting techniques to predict future activity and their accuracy. At the end I’ll pose some questions related to the assumptions made in this analysis and how sound this approach is.

## Set up

install.packages(c('devtools','httr','forecast','lubridate'))
library(devtools)
library(httr)
library(lubridate)
install_github('zatonovo/odessa')
library(odessa)
library(forecast)


## Obtaining the data

Using Twitter’s API it’s possible to track data for a given set of search terms. The twitteR package provides functionality in R to connect to Twitter. However, it can be a bit cumbersome to use, particularly for streaming analysis, so we’ll use an alternative source. Our data comes from the Panoptez Model API, which is offered by Zato Novo1. A number of social media models are available via Panoptez, but we’ll only use the hashtag tracker service. Once the tracker is set up by Zato Novo, it’s possible to start downloading data. A simple HTTP request will download data for a given date range and query id (provided by Zato Novo).

url <- "http://api.panoptez.ml/v1/twitter/query/timeline/3.json?start=2015-06-04&stop=2015-06-04&app_id=APP_ID&app_key=APP_KEY"
response <- GET(url)
raw <- content(response)


The httr package takes care of parsing the JSON, which results in a list of lists of tweets. This is an artifact of the data storage, where each record contains 100 tweets. To get a flattened list of tweets simply requires calling tweets <- do.call(c, raw), which concatenates each list of 100 tweets together. Since tweets are serialized as JSON, they are not immediately compatible with a tabular data structure. Personally I prefer working with data.frames rather than deep list structures as they are more convenient. Converting a deep list into a data.frame is non-trivial, which is why I wrote the odessa package to (among other things) take care of this for us with the denormalize function. The API is simple: provide a named list and specify the keys to keep or drop. Odessa understands tree structures, so nested keys (ie terminal nodes) can be specified using dot notation. The requirement is that values in terminal nodes must be scalar, otherwise it’s incompatible with the table structure. As an example, a user object is embedded within tweets by default. Each user has a unique identifier named id_str (the id field is 64 bit, so it’s more convenient to use the string representation since R integers are 32 bit). The id of each tweeter is included by specifying user.id_str in the keep argument. Here we’re including a few other tweet attributes as well.

df <- denormalize(tweets,
keep=c('id_str','created_at','user.id_str','user.followers_count'))


The data is now in a useful format. The head of the data looks like

> head(df)
id_str          created_at user.id_str user.followers_count
1 609034785307152384 2015-06-11 12:29:42  3222850370                   96
2 609034776075481089 2015-06-11 12:29:40   598702629                   69
3 609034770958397441 2015-06-11 12:29:39   621822333                  919
4 609034754260783104 2015-06-11 12:29:35  2704548373                 7746
5 609034753036136448 2015-06-11 12:29:35  2603320279                  240
6 609034743099854848 2015-06-11 12:29:32  3255807904                  118


Let’s bin the tweets into one minute intervals, which gives us a time series of counts that we can use to forecast. A simple approach is to create a column that represents the hour and minute of the tweet timestamp to use as a grouping key. We extract these time components using lubridate and then use some formatting to create the value.

TIME_FORMAT <- "%a %b %d %H:%M:%S %z %Y"

df$created_at <- strptime(as.character(df$created_at), TIME_FORMAT)
df$hour <- hour(df$created_at)
df$minute <- minute(df$created_at)
df$time <- sprintf('%02d:%02d', df$hour, df$minute)  The above code gives us a character vector of times in HH:MM format. Creating a contingency table is a simple way of counting the tweets in each bin, which can then be plotted. counts <- table(df$time)
plot(counts, col='gray60')


Figure 1. Minute counts for #rstats, #datascience, etc.

Now we have two representations of the data: one as a data.frame and the other as a plain time series. For the remainder of the post we’ll use the time series, but in a future post, we’ll use the data.frame for additional analysis.

## Basic forecasting methods

In Chapter 2 of Hyndman & Athana­sopou­los, we are introduced to a number of forecasting methods. The naive forecast (see ?naive in the forecast package) uses the last value as the predicted values and will extend forward as many periods as requested. The drift method (see ?rwf) assumes a constant trend and extends the trend forward in time. Figure 2 plots the naive method along with the drift method for different window lengths. (Note: I’m not including code for this part since it’s similar to some homework questions.)

Figure 2. Forecasts of tweet counts using various techniques

For the naive and basic drift method, the whole historical series is used, whereas the last two methods use a 60 minute and 15 minute historical window.
Visually the methods do not seem particularly accurate. The forecast package gives us more precise measures of accuracy.

> lapply(fc[-1], function(x) accuracy(x,fc[[1]]))
[[1]]
ME     RMSE      MAE       MPE     MAPE       ACF1 Theil's U
Test set 0.2 7.793159 5.533333 -44.99411 64.05295 -0.2120387 0.4153102

[[2]]
ME     RMSE     MAE       MPE     MAPE      ACF1 Theil's U
Test set 0.9539267 7.828896 5.53822 -39.59397 61.51452 -0.211353 0.4481512

[[3]]
ME     RMSE      MAE       MPE     MAPE       ACF1 Theil's U
Test set -0.6 7.857905 5.746667 -50.72426 67.74956 -0.2067767 0.3812549

[[4]]
ME     RMSE      MAE      MPE     MAPE       ACF1 Theil's U
Test set -4.066667 9.232712 7.053333 -75.5549 85.36238 -0.1234173 0.2511051


We’ve only scratched the surface of the analysis process. In the next post, we’ll look at rolling windows, a time series version of cross-validation, followed by an analysis of residuals. This will feed into a similar analysis of a different hashtag.

## Questions

1. Under what circumstances would these models be appropriate?
2. Which accuracy measure best represents the accuracy? Why?
3. How could partitioning of the data improve the analysis?

Footnotes
1. Disclaimer: I’m the founder of Zato Novo