Data preparation for Social Network Analysis using R and Gephi

I want to share my experience in generating the data for social network analysis using R and analyzing it using Gephi…

I quickly realized that using edge lists and adjacency matrix gets difficult as the graph size increases. So I needed an alternative graph format that was efficient (for storage) and flexible to capture details like edge weight. I chose Gephi’s gexf file format as it can handle large graphs, and it supports dynamic and hierarchical structure. Checkout gexf comparison with other formats for details.

As I tried to process millions of rows of email log to derive the edgelist, I realized a couple of things…

1) R cannot handle data larger than my computer’s RAM. So I had to look for a way to use R for large data sets. R packages like RMySQL and SQLDF came in handy for this. SQLDF uses SQLlite, an in-memory database. If your data cannot fit into RAM then you can instruct SQLLITE to use persistent store for handling large data sets.
Note: There are many other ways to handle large data in R effectively, e.g. R multicore package for parallel processing, R on MapReduce/Hadoop, etc. Check out the presentation on high performance computing in R for other techniques like ff and bigmemory. Please shout if there are other ways that you used…

2) Some operations are better suited for database/RDBMS: I offloaded RDBMS-suited tasks to SQLlite, the default database used by SQLDF.

3) Learn memory management in R:
– By default R allocates ~1.5GB memory for its use. I allocated more memory for R to handle larger objects using the command “memory.limit(size=3000)”
– Remove unwanted objects from the R session e.g.
rm(raw_emails, emails, to_nodes,from_nodes,all_nodes, unique_nodes)
gc() # call garbage collection explicitly

Gephi wasn’t able to handle very large graph files (e.g. for files > 500MB size, Gephi was either too slow or stopped responding). So I had to do a couple of things…

1) Increase the amount of memory Gephi allocates for the JVM at startup: By default Gephi allocates 512MB memory for JVM. This wasn’t enough to load the large graph file, so I increased the max. memory Gephi allocated for JVM to 1.4GB.

Edit C:\Program Files\Gephi-0.7\etc\gephidesktop.conf file and changing the line
default_options=”–branding gephidesktop -J-Xms64m -J-Xmx512m” to
default_options=”–branding gephidesktop -J-Xms64m -J-Xmx1400m”

2) Decrease the file size by reducing the text in the graph file e.g. use shorter node_ids, edge_ids etc.

Also, Gephi complained about incorrect file format (it expects UTF-8 encoded XML files). I fixed this simply by opening the graph file generated by R in Textpad and saving it in UTF-8 format before feeding it to Gephi.

1) R is more than a statistical tool. I was able to manipulate and clean large data sets (500+ million rows) easily. I will continue learning it. Its fun and rewarding.

2) There are other sophisticated tools for visual social network analysis like Network Workbench
I will explore it for heavy analysis, but Gephi is very easy to use and continues to be my favorite.

3) Use a machine with a lot of RAM, as both Gephi and R are memory hungry

By the way, here’s the R code I used for preparing the graph from email logs for social network analysis using R and Gephi. I’m sure there are better ways to accomplish this. Please shout if you notice any.

