Experiments with igraph
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Networks – social and biological – are all the rage, just now. Indeed, a recent entry at Duncan’s QOTD described the “hairball” network representation as the dominant cultural icon in molecular biology.
I’ve not had occasion to explore networks “professionally”, but have always been fascinated by both networks and the tools used to analyse them. My grasp of graph theory, the mathematics behind networks, is more or less summarised by this Wikipedia page. I’ve also been exploring the igraph library and thought I’d share a few of my “experiments with igraph”. As I say, I’m learning myself as I go along, so none of this should be taken as professional advice.
Let’s start with my favourite network – FriendFeed of course – and ask a few questions about everyone’s favourite group, The Life Scientists (TLS).
My initial thought was: to what extent is TLS a community? In other words, do people who subscribe to TLS also tend to subscribe to one another?
1. Fetching the raw data from FriendFeed
To begin, I used the FriendFeed API to fetch subscribers to TLS and then subscriptions within TLS subscribers. It’s becoming confusing already – roll the Ruby:
#!/usr/bin/ruby require "open-uri" require "json/pure" # get TLS feed info subscribers = [] # auth = [USER, REMOTE_KEY] tls = JSON.parse(open("http://friendfeed-api.com/v2/feedinfo/the-life-scientists", :http_basic_authentication => auth).read) # only public feeds tls['subscribers'].each do |sub| if sub['private'].nil? subscribers << sub['id'] end end # user subscriptions within TLS puts "There are #{subscribers.length} public feeds." network = [] subscribers.each do |id| puts "Fetching #{id}..." feed = JSON.parse(open("http://friendfeed-api.com/v2/feedinfo/#{id}").read) network << [id, "the-life-scientists"] feed['subscriptions'].each do |s| if subscribers.include?(s['id']) network << [id, s['id']] end end sleep 2 end # write out pairs File.open("network.csv", "w") do |f| network.each do |pair| f.puts(pair.join(",")) end end
Lines 6-15 fetch information about the TLS feed. A few words about privacy. If you uncomment line 8 and substitute your FriendFeed user name and remote key, line 9 will make an authenticated request which returns information for users with private feeds, if you are subscribed to them. However, those people may not want their data exposed to non-subscribers. So, in lines 11-15, we take user names and only store them in an array if their feed is public. The upshot of all that is that of 1 323 users that I can access, we use only 1 128 with public feeds.
Lines 17-30 process each TLS subscriber and their subscriptions, storing the results in the network array. Only subscriptions to users who are also subscribed to TLS are stored (lines 24-28), plus the subscription to TLS (line 23). There’s also a small pause between feeds, just in case we’re hammering the API too hard.
At the end, each user-subscription pair is written to a CSV file, one pair per line. Total = 17 468 lines, some of which look like this:
neilfws,the-life-scientists neilfws,aroy neilfws,abhishektiwari neilfws,adamk neilfws,adrianh neilfws,ajc ....
2. Using igraph in R
Igraph can be used in C, Python, Ruby or R – let’s go with R. Reading in the file and converting to an igraph object is straightforward. I’m assuming that the subscriptions graph is directed, in that you subscribing to me has direction you –> me and vice versa, me –> you.
library(igraph) tls <- read.csv("network.csv", header = F) g <- graph.data.frame(tls, directed = T) ecount(g) [1] 17468 vcount(g) [1] 1129
Voilà – a graph with 1 129 vertices (nodes) and 17 468 edges (connections). What can we do with that? Well, there are a few simple summary metrics, such as graph diameter:
# network diameter diameter(g) [1] 7 # show the farthest nodes farthest.nodes(g) [1] 133 928 7
That tells us that the average minimum distance between a pair of vertices in the TLS network is 7 people – no surprise there.
However, the real meat of igraph comes with methods that describe edges, vertices, subgraphs, communities and so on. First example: the largest “clique”. A clique is a maximally-connected subgraph in which every vertex connects to every other vertex.
lc <- largest.cliques(g) lc [[1]] [1] 87 111 117 120 139 260 293 326 381 484 500 549 663 681 705 [16] 725 738 753 770 775 794 832 834 847 885 962 1073 1120 1128
I know that you want the identity of these individuals, so here’s a quick and dirty plot.
# create a new graph of the largest clique V(g)$label <- V(g)$name g.lc <- subgraph(g, lc[[1]]) png(filename = "lc.png") plot(g.lc, layout=layout.fruchterman.reingold, vertex.color="gray60", vertex.size = 0, edge.arrow.size = 0.5, edge.color = "gray80") dev.off()
The result is the left figure, below; click on the image for a larger version.
Adding properties to vertices and edges is easy, using the V or E functions, respectively. You can add whatever you like: for example, V(g)$foo <- bar, provided that bar is a vector of length equal to number of vertices. Here’s an example in which we try to find communities in the TLS network using a model from statistical mechanics called spin-glass [1, 2], label the vertices and change vertex colour and size:
sgc <- spinglass.community(g) V(g)$membership <- sgc$membership # found 4 communities 0, 1, 2, 3 V(g) [ membership == 0 ]$color <- "cyan" V(g) [ membership == 1 ]$color <- "green" V(g) [ membership == 2 ]$color <- "blue" V(g) [ membership == 3 ]$color <- "red" V(g)$size <- 4 V(g) [ name == "the-life-scientists" ]$size <- 20 > png(filename = "tls.png", height = 800, width = 800) > plot(g, layout=layout.fruchterman.reingold, vertex.color=V(g)$color, vertex.size = V(g)$size, vertex.label = NA, edge.arrow.size = 0.5) > dev.off()
Result at far-right.
The large blue circle in the centre is The Life Scientists group. One community, the blue circles around the edge, is composed of people who subscribe to the group but do not subscribe to many (or any) members of the group. As for the red, cyan and green communities – more investigation is required. I can only tell you that I belong to the “red” community, where I also see many names that I would recognise as (bio)informaticians, computational biologists and active researchers in biology.
I still have plenty to learn about both igraph and network analysis; hopefully, this is a useful initial exploration.
References
1. J. Reichardt and S. Bornholdt (2006)
Statistical Mechanics of Community Detection
Phys. Rev. E (74), 016110
2. M. E. J. Newman and M. Girvan (2004)
Finding and evaluating community structure in networks
Phys. Rev. E (69), 026113
Filed under: programming, R, research diary, ruby, statistics Tagged: graph theory, igraph, networks
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.