Data preparation for Social Network Analysis using R and Gephi

June 2, 2010

(This article was first published on Enterprise Software Doesn't Have to Suck, and kindly contributed to R-bloggers)

I want to share my experience in generating the data for social network analysis using R and analyzing it using Gephi…

I quickly realized that using edge lists and adjacency matrix gets difficult as the graph size increases. So I needed an alternative graph format that was efficient (for storage) and flexible to capture details like edge weight. I chose Gephi’s gexf file format as it can handle large graphs, and it supports dynamic and hierarchical structure. Checkout gexf comparison with other formats for details.

As I tried to process millions of rows of email log to derive the edgelist, I realized a couple of things…

1) R cannot handle data larger than my computer’s RAM. So I had to look for a way to use R for large data sets. R packages like RMySQL and SQLDF came in handy for this. SQLDF uses SQLlite, an in-memory database. If your data cannot fit into RAM then you can instruct SQLLITE to use persistent store for handling large data sets.
Note: There are many other ways to handle large data in R effectively, e.g. R multicore package for parallel processing, R on MapReduce/Hadoop, etc. Check out the presentation on high performance computing in R for other techniques like ff and bigmemory. Please shout if there are other ways that you used…

2) Some operations are better suited for database/RDBMS: I offloaded RDBMS-suited tasks to SQLlite, the default database used by SQLDF.

3) Learn memory management in R:
– By default R allocates ~1.5GB memory for its use. I allocated more memory for R to handle larger objects using the command “memory.limit(size=3000)”
– Remove unwanted objects from the R session e.g.
rm(raw_emails, emails, to_nodes,from_nodes,all_nodes, unique_nodes)
gc() # call garbage collection explicitly

Gephi wasn’t able to handle very large graph files (e.g. for files > 500MB size, Gephi was either too slow or stopped responding). So I had to do a couple of things…

1) Increase the amount of memory Gephi allocates for the JVM at startup: By default Gephi allocates 512MB memory for JVM. This wasn’t enough to load the large graph file, so I increased the max. memory Gephi allocated for JVM to 1.4GB.

Edit C:\Program Files\Gephi-0.7\etc\gephidesktop.conf file and changing the line
default_options=”–branding gephidesktop -J-Xms64m -J-Xmx512m” to
default_options=”–branding gephidesktop -J-Xms64m -J-Xmx1400m”

2) Decrease the file size by reducing the text in the graph file e.g. use shorter node_ids, edge_ids etc.

Also, Gephi complained about incorrect file format (it expects UTF-8 encoded XML files). I fixed this simply by opening the graph file generated by R in Textpad and saving it in UTF-8 format before feeding it to Gephi.

1) R is more than a statistical tool. I was able to manipulate and clean large data sets (500+ million rows) easily. I will continue learning it. Its fun and rewarding.

2) There are other sophisticated tools for visual social network analysis like Network Workbench
I will explore it for heavy analysis, but Gephi is very easy to use and continues to be my favorite.

3) Use a machine with a lot of RAM, as both Gephi and R are memory hungry

By the way, here’s the R code I used for preparing the graph from email logs for social network analysis using R and Gephi. I’m sure there are better ways to accomplish this. Please shout if you notice any.

To leave a comment for the author, please follow the link and comment on their blog: Enterprise Software Doesn't Have to Suck. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)