This post is a lecture for IS624 Predictive Analytics, which is part of the CUNY Master’s program in Data Analytics.
In class, we discussed the characteristics of the #rstats hashtag, and its apparent randomness at a minute frequency. We surmised that numerous factors contribute to this, such as multiple discussions, time zone differences, and scheduled tweets. At this frequency not much can be inferred from the hashtag, although at slower frequencies, it would be possible to measure the general popularity of the set of hashtags. If we partitioned the data by time zone, we could perform a “seasonal” analysis over the days of the week to see when people are talking about #rstats the most. We could also partition the data by location and see where people use these hashtags. Many marketing analytics companies do analyses like these on a user’s tweets to maximize reach.
We might be tempted to say all hashtags behave like this. If we did, we would miss out on a lot of valuable insight. Let’s look at another hashtag, #PDF15, which is from the Personal Democracy Forum, held in New York City on June 4-5. Since this is an event hashtag where all participants are physically together, its behavior is markedly different.
On the surface the tweet activity may appear equally random, although there is dramatically more activity at the beginning of the window. When we overlay the conference schedule, a different story appears. (I also added a 15 day moving average for good measure.)
What we see is that tweet activity is greater during sessions than during lunch and breaks. This implies that 1) people were actively tweeting about presentations, and 2) people were actually mingling as opposed to being glued to their phones during networking breaks. The exception is at the end of a session, where it appears people take a moment to tweet about the session before heading out. The talks before lunch were very popular and have significantly more activity than other sessions during the measurement period. These sessions were about 15 minutes each, and we can see that they ran behind schedule, bleeding into lunch.
Two interesting time series patterns emerge from this hashtag. The first is that there appears to be a fast exponential decay in the tweet activity at the end of a conference session. This seems like a candidate for survival analysis, which we’ll re-visit near the end of the course. The second is that the growing tweet activity is highly autocorrelated.
I was surprised by this, but a lag plot shows that this is indeed the case.
In contrast, #rstats doesn’t show nearly as much autocorrelation.
So what causes this autocorrelation in a conference hashtag? One hypothesis is that the autocorrelation is a manifestation of retweeting. By virtue of the requisite lag between a tweet and a retweet combined with positive feedback loops (and the consequent power law effect), a popular tweet can grow like a wave coming to shore. Indeed, by separating tweets and retweets it appears that the growth of retweets occurs after tweets, which explains the autocorrelation.
This makes sense since in a conference environment people are talking about the same specific topic, so the effects of retweeting will be more pronounced. In contrast, many disconnected individuals are tweeting about the same general topic will exhibit a more muted effect since there can be overlapping discussions in the same hashtag (and ours includes more than just #rstats).
Since structure exists within the conference hashtag stream we can make inferences that didn’t make sense for a noisier hashtag. For example, it’s possible to measure which conference sessions were most tweet-worthy. Conference organizers can evaluate whether there’s enough networking going on (inversely proportional to tweeting), or if they need to restructure the event to encourage networking. Presenters can see if people are getting restless at the end of a talk based on whether the tweet frequency increases.
These two hashtags illustrate how an analysis is dependent on how a hashtag is used. Some hashtags have very little structure at higher frequencies and must be binned and filtered to isolate structure. Others may be temporary in nature and have clear structure at higher frequencies. It is up to the data scientist to use the correct methods based on the data at hand.