June 29, 2012
By

(This article was first published on distributed ecology, and kindly contributed to R-bloggers)

There’s a great Tom Waits song from the album “Mule Variations” called “Big in Japan”. The beauty of saying you’re big in Japan is that no one can ever really verify the statement (or at least that was more true in 1999). You might assert “my work is big on twitter”, and hey, how would I know? I think we’re all agreed now that if you’re a scientist being big on twitter is important. What about how much exposure your work gets on twitter though? In the research world, people are working on lot’s of interesting ways of measuring the impact of an article. People like Heather Piwowar, who cofounded total impact, are working to change how we measure the importance of a paper. More and more people want to look at the impact beyond just how many times other researchers cite it. That’s where projects like altmetrics and article level metrics from Plos come in. These are all great tools and I don’t doubt the future of how we measure impact. But what if you want to look under the hood of twitter and see what’s going on with a given research article? There’s lot’s web based tools (like tweetreach), but none of them offer a concise way to extract and store twitter data about the impact of scientific articles. Enter impacTwit (a slightly tongue-in-cheek name).

impacTwit is a collection of R functions that will output data about who tweeted and retweeted about any collection of search terms in a data frame that you can make easily plot. It gives you the time stamp, originating tweeter, and follower count of each tweet about a vector of search terms. It will sort them all by date and give you cumulative sums for the entire set of search terms, cumulative sums by originating tweeter, and cumulative sums by search term. It let’s you easily dissect the people who are influential about a paper, or the sources, and gives you a sense of the total impact on twitter. Total impact here is defined as the number of potential viewers of a tweet. Before I give a worked example I’ll say two caveats about “total impact”. Yes, just because a tweet is retweeted to 10,000 people doesn’t mean they all see it, and even if they see it, how many actually click on the article link to go read it? It is an imperfect metric to be sure.

test.str <- c("Scientists think math is hard too","http://www.pnas.org/content/early/2012/06/22/1205259109.abstract","Heavy use of equations impedes communication among biologists")tweet.dat <- impacTwit(test.str)
We can then generate a series of plots for the resulting data frame, the first one is just a cumulative sum of the total impact.

Here is just the total number of potential viewers of the article from all people and all sources.  Ok, so that’s interesting, it topped out over 120,000 potential views.  What if we want to know who was influential about this?  Well we can parse our data frame and subset it so it only has the top retweets, from sources with 5 or more retweets and plot those out by originating tweeter.  The first plot has the top sources all plotted on a relative time scale, so the x axis is time since the original tweet that was retweeted.

We can also plot this on an absolute time scale to see when these retweets came into the stream.  As you can see @PlanktonMath was influential early, whereas others came in late to the game.  @BioScienceMum had 5 retweets, but really all by low impact people, so retweet count isn’t always a good measure of impact.

Finally let’s examine this by source.  Our search string had 3 terms.  I parsed those out as the AP story, the direct link to the scientific article, and just the title.  The AP is all popular press, the direct link only science, and the title is a bit of both (non-AP sources plus some posts with the direct link).  impacTwit can do plots like both of the above, but here’s just one on an absolute time scale.
It’s clear that people tweeting about the article itself were less impactful, but they tweeted about it longer.  The AP tweets are a big splash and then they’re gone.  You can try this all out yourself with the code over at my github, which has this example fully documented (including the parsing for figures).  I’m open to any suggestions about features or improvements.

Thanks Galen. I might convert the normalizing by total tweets as another way of looking at it.
Nice Ted.

You might consider using entropy, and normalizing by total tweets to give relative frequencies.

You may find our presentation on Google N-Grams useful when thinking about networks: