A Tool Chain for Plotting Twitter Archive Retweet Graphs – Py, R, Gephi

December 21, 2011
By

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

Another set of stepping stones that provide a clunky route to a solution that @mhawksey has been working on a far more elegant expression of (eg Free the tweets! Export TwapperKeeper archives using Google Spreadsheet and Twitter: How to archive event hashtags and create an interactive visualization of the conversation)…

The recipe is as follows:

– download a Twapperkeeper archive to a CSV file using a Python script as described in Python Script for Exporting (Large) Twapperkeeper Archives By User; the CSV file should contain a single column with one row per archive entry; each row includes the sender, the tweet, the tweet ID and a timestamp;

– in an R environment (I use RStudio), reuse code from Rescuing Twapperkeeper Archives Before They Vanish and Cornelius Puschmann’s post Generating graphs of retweets and @-messages on Twitter using R and Gephi:

require(stringr)

#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)

twapperkeeperCSVParse=function(fp){
    df = read.csv(fp, header=F)
    df$from=sapply(df$V1,function(tweet) str_extract(tweet,"^([[:alnum:]_]*)"))
    df$id=sapply(df$V1,function(tweet) str_extract(tweet,"[[:digit:]/s]*$"))
    df$txt=sapply(df$V1,function(tweet) str_trim(str_replace(str_sub(str_replace(tweet,'- tweet id [[:digit:]/s]*$',''),end=-35),"^([[:alnum:]_]*:)",'')))
    df$to=sapply(df$txt,function(tweet) trim(str_extract(tweet,"^(@[[:alnum:]_]*)")))
    df$rt=sapply(df$txt,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
    return(df)
}
#usage: 
#twarchive.df=twapperkeeperRescue("PATH_TO_YOUR_FILE")
#For example:
df=twapperkeeperRescue("~/code/twapps/reports/twArchive_online11.txt")

ats.df <- data.frame(df$from,df$to)
rts.df <- data.frame(df$from,df$rt)

#Cribbing http://blog.ynada.com/339
require(igraph)
ats.g <- graph.data.frame(ats.df, directed=T)
rts.g <- graph.data.frame(rts.df, directed=T)

write.graph(ats.g, file="ats.graphml", format="graphml")
write.graph(rts.g, file="rts.graphml", format="graphml")

– Cornelius’ code uses the igraph library to construct a graph and export graphml files that describe graphs of at behaviour (tweets in the archive sent from one user to another) and RT behaviour (tweets from one person retweeting another using the RT @name convention).

– visualise the graphml files in Gephi. Note a couple of things – empty nodes aren’t handled properly in my version of the code, so the graph includes a dummy node that all non-at or non-RT row tweet senders point to; when you visualise the graph, this node will be obvious, so just delete it 😉

– the Gephi visualisation by default uses the Label attribute for labeling nodes – we need to change this:

Gephi - setting node label choice

You should now be able to view graphs that illustrate RT or @ behaviour as captured in a Twapperkeeper archive in Gephi.

ILI2011 RT behaviour

Just by the by, we can also generate stats’n graphs of the contents of the archive. For example, via Getting Started With Twitter Analysis in R, we can generate a bar plot to show who was retweeted most:

require(ggplot2)

ggplot()+geom_bar(aes(x=na.omit(df$rt)))+opts(axis.text.x=theme_text(angle=-90,size=6))+xlab(NULL)

We can also do some counting to find out who was RT’d the most, for example:

#count the occurrences of each name in the rt column
rt.count=data.frame(table(df$rt))
#sort the results in descending order and display the top 5 results
head(rt.count[order(-rt.count$Freq),],5)
#There are probably better ways of doing that! If so, let me know via comments

Next on the to do list is:
– automate the production of archive reports
– work in the time component so we can view behaviour over time in Gephi… (here’s a starting point maybe, again from Cornelius Puschmann’s blog: Dynamic Twitter graphs with R and Gephi (clip and code))

As things stand though, I may not be able to get round to either of those for a while…

To leave a comment for the author, please follow the link and comment on their blog: OUseful.Info, the blog... » Rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: ,

Comments are closed.

Sponsors

Mango solutions



RStudio homepage



Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training



http://www.eoda.de









ODSC

CRC R books series











Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)