Retweets, Modified Tweets, vias: what’s in the SoMeLab dataset

[This article was first published on SoMe Lab » r-project, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Since October we have been collecting tweets related to the Occupy movement and so far we’ve picked up 64,298,104 tweets. In future posts we will give you some insight into our process, but today the question is, what is the difference between new style retweets, old style retweets and new emerging tags like MT?

In the graphic you can see that by far most tweets in our set are tweets – not retweets, and that of the 12,159,856 tweeters in our set, only about 13% get retweeted (the decimal is in the wrong place for both retweets and retweeted users. Sorry :-) .  But these retweet numbers represent what we call ‘new-style’ retweets. Retweets originally emerged out of collective user behavior wherein people forwarded tweets. Over time the standard that emerged was to add “RT @screennam” to the tweet. Sometimes when a message was repeatedly retweeted you might see “RT @screennam2: RT @screennam1″ and in this way be able to determine the path a tweet might have taken to get to you.

Now Twitter has added functionality that allows you to retweet without specifically adding the text. This allows researchers (and marketers) to identify retweets using the tweet meta data, instead of parsing the text. In fact, the meta data allows us to easily identify the origin of the tweet – or, who the original tweeter was. But Twitter has no mechanism to all us to accurately trace the paths.

Further, some folks still use the old style meaning we still need to parse the text of the tweet to get an accurate count of retweets. And this number can be significant.

In the chart to the right I have shown the ratio of tweets, new style retweets and old style retweets for a subset of our collection (only tweets with the hashtag #OccupyOakland from 10/12/2011 to 11/20/2011). I this set 8% are old-style retweets. But in this 8% I am including modified tweets (MT) retweets (RT) and via.

I think as researchers start to code text from tweets one thing we need to think about is the different meanings that apply when people use these different mechanisms of attribution. Comments?

In the meantime, here is a bit of network eye candy.

This is a retweet network where the nodes are people and links are cases were someone retweeted someone else. This is from the OccupyOakland data set and is made of just the new style retweets. Here are some interesting things to note:

Nodes are sized by how many times those users’ tweets were retweeted.

The ring around the outside represents people who tweeted with the OccupyOakland hashtag and were retweeted but not by anyone in the code of the network.

The core is densely connected, which makes sense for a few reasons. First, the is a collection of retweets over 30 days, so it represents many information flows and connections between people. Second, the OccupyOakland data set has a surprisingly high rate of retweeting. In the first graph we can see the rate of retweeting across the whole 65 million tweets in our set is low – about 7% (again, the decimal is the wrong place on the plot). For the Oakland subset it is 64% for new style and 72% for both new and old!


All of these visualizations were created using R. I’m happy to post example code if anyone is interested.



To leave a comment for the author, please follow the link and comment on their blog: SoMe Lab » r-project. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)