Generating graphs of retweets and @-messages on Twitter using R and Gephi

October 17, 2010
By

(This article was first published on Cornelius Puschmann's Blog » R, and kindly contributed to R-bloggers)

After recently discovering the excellent methods section on mappingonlinepublics.net, I decided it was time to document my own approach to Twitter data. I’ve been messing around with R and igraph for a while, but it wasn’t until I discovered Gephi that things really moved forward. R/igraph are great for preprocessing the data (not sure how they compare with Awk), but rather cumbersome to work with when it comes to visualization. Last week, I posted a first Gephi visualization of retweeting at the Free Culture Research Conference and since then I’ve experimented some more (see here and here). #FCRC was a test case for a larger study that examines how academics use Twitter at conferences, which is part of what we’re doing at the junior researchers group Science and the Internet at the University of Düsseldorf (sorry, website is currently in German only).

Here’s a step-by-step description of how those graphs were created.

Step #1: Get tweets from Twapperkeeper
Like Axel, I use Twapperkeeper to retrieve tweets tagged with the hashtag I’m investigating. This has several advantages:

  • it’s possible to retrieve older tweets which you won’t get via the API
  • tweets are stored as CSV rather than XML which makes them easier to work with for our purposes.

The sole disadvatage of Twapperkeeper is that we have to rely on the integrity of their archive — if for some reason not all tweets with our hastag have been retrieved, we won’t know. Also, certain information is not retained in Twapperkeepers’ CSV files that is present in Twitter’s XML (e.g. geolocation) that we might be interested in.

Instructions:

  1. Search for the hashtag you’re interested in (e.g. #FCRC). If no archive exists, create one.
  2. Go to the archive’s Twapperkeeper page, sign into Twitter (button at the top) and then choose export and download at the bottom of the page
  3. Choose the pipe character (“|”) as seperator. I use that one rather than the more conventional comma or semicolon because we are dealing with text data which is bound to contain these characters a lot. Of course the pipe can also be parsed incorrectly, so be sure to have a look at the graph file you make.
  4. Voila. You should now have a CSV file containing tweets on your hard drive. Edit:Actually, you have a .tar file that contains the tweets. Look inside the .tar for a file with a very long name ending with “-1″ (not “info”) — that’s the data we’re looking for.

Step #2: Turn CSV data into a graph file with R and igraph
R is an open source statistics package that is primarily used via the command line. It’s absolutely fantastic at slicing and dicing data, although the syntax is a bit quirky and the documentation is somewhat geared towards experts (=statisticians). igraph is an R package for constructing and visualizing graphs. It’s great for a variety of purposes, but due to the command line approach of R, actually drawing graphs with igraph was somewhat difficult for me. But, as outlined below, Gephi took care of that. Running the code below in R will transform the CSV data into a GraphML file which can then be visualized with Gephi. While R and igraph rock at translating the data into another format, Gephi is the better tool for the actual visualization.

Instructions:

  1. Download and install R.
  2. In the R console, run the following: install.packages(igraph);
  3. Copy the CSV you’ve just downloaded from Twapperkeeper to an empty directory and rename it to tweets.csv.
  4. Finally, save the R file below to the same folder as the CSV and run it.

Code for extracting RTs and @s from a Twapperkeeper CSV file and saving the result in the GraphML format:

?Download tweetgraph.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Extract @-message and RT graphs from conference tweets
library(igraph);
 
# Read Twapperkeeper CSV file
tweets <- read.csv("tweets.csv", head=T, sep="|", quote="", fileEncoding="UTF-8");
print(paste("Read ", length(tweets$text), " tweets.", sep=""));
 
# Get @-messages, senders, receivers
ats <- grep("^\\.?@[a-z0-9_]{1,15}", tolower(tweets$text), perl=T, value=T);
at.sender <- tolower(as.character(tweets$from_user[grep("^\\.?@[a-z0-9_]{1,15}", tolower(tweets$text), perl=T)]));
at.receiver <- gsub("^\\.?@([a-z0-9_]{1,15})[^a-z0-9_]+.*$", "\\1", ats, perl=T);
print(paste(length(ats), " @-messages from ", length(unique(at.sender)), " senders and ", length(unique(at.receiver)), " receivers.", sep=""));
 
# Get RTs, senders, receivers
rts <- grep("^rt @[a-z0-9_]{1,15}", tolower(tweets$text), perl=T, value=T);
rt.sender <- tolower(as.character(tweets$from_user[grep("^rt @[a-z0-9_]{1,15}", tolower(tweets$text), perl=T)]));
rt.receiver <- gsub("^rt @([a-z0-9_]{1,15})[^a-z0-9_]+.*$", "\\1", rts, perl=T);
print(paste(length(rts), " RTs from ", length(unique(rt.sender)), " senders and ", length(unique(rt.receiver)), " receivers.", sep=""));
 
# This is necessary to avoid problems with empty entries, usually caused by encoding issues in the source files
at.sender[at.sender==""] <- "<NA>";
at.receiver[at.receiver==""] <- "<NA>";
rt.sender[rt.sender==""] <- "<NA>";
rt.receiver[rt.receiver==""] <- "<NA>";
 
# Create a data frame from the sender-receiver information
ats.df <- data.frame(at.sender, at.receiver);
rts.df <- data.frame(rt.sender, rt.receiver);
 
# Transform data frame into a graph
ats.g <- graph.data.frame(ats.df, directed=T);
rts.g <- graph.data.frame(rts.df, directed=T);
 
# Write sender -> receiver information to a GraphML file
print("Write sender -> receiver table to GraphML file...");
write.graph(ats.g, file="ats.graphml", format="graphml");
write.graph(rts.g, file="rts.graphml", format="graphml");

Step #3: Visualize graph with Gephi
Once you’ve completed steps 1 and 2, simply open your GraphML file(s) with Gephi. You should see a visualization of the graph. I won’t give an in-depth description of how Gephi works, but the users section of gephi.org has great tutorials which explain both Gephi and graph visualization in general really well.

I’ll post more on the topic as I make further progress, for example with stuff like dynamic graphs which show change in the network over time.

To leave a comment for the author, please follow the link and comment on his blog: Cornelius Puschmann's Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.