Experiments with igraph

April 21, 2010
By

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

Networks – social and biological – are all the rage, just now. Indeed, a recent entry at Duncan’s QOTD described the “hairball” network representation as the dominant cultural icon in molecular biology.

I’ve not had occasion to explore networks “professionally”, but have always been fascinated by both networks and the tools used to analyse them. My grasp of graph theory, the mathematics behind networks, is more or less summarised by this Wikipedia page. I’ve also been exploring the igraph library and thought I’d share a few of my “experiments with igraph”. As I say, I’m learning myself as I go along, so none of this should be taken as professional advice.

Let’s start with my favourite network – FriendFeed of course – and ask a few questions about everyone’s favourite group, The Life Scientists (TLS).

My initial thought was: to what extent is TLS a community? In other words, do people who subscribe to TLS also tend to subscribe to one another?

1. Fetching the raw data from FriendFeed
To begin, I used the FriendFeed API to fetch subscribers to TLS and then subscriptions within TLS subscribers. It’s becoming confusing already – roll the Ruby:

#!/usr/bin/ruby

require "open-uri"
require "json/pure"

# get TLS feed info
subscribers = []
# auth = [USER, REMOTE_KEY]
tls  = JSON.parse(open("http://friendfeed-api.com/v2/feedinfo/the-life-scientists", :http_basic_authentication => auth).read)
# only public feeds
tls['subscribers'].each do |sub|
  if sub['private'].nil?
    subscribers << sub['id']
  end
end

# user subscriptions within TLS
puts "There are #{subscribers.length} public feeds."
network = []
subscribers.each do |id|
  puts "Fetching #{id}..."
  feed = JSON.parse(open("http://friendfeed-api.com/v2/feedinfo/#{id}").read)
  network << [id, "the-life-scientists"]
  feed['subscriptions'].each do |s|
    if subscribers.include?(s['id'])
      network << [id, s['id']]
    end
  end
  sleep 2
end

# write out pairs
File.open("network.csv", "w") do |f|
  network.each do |pair|
    f.puts(pair.join(","))
  end
end

Lines 6-15 fetch information about the TLS feed. A few words about privacy. If you uncomment line 8 and substitute your FriendFeed user name and remote key, line 9 will make an authenticated request which returns information for users with private feeds, if you are subscribed to them. However, those people may not want their data exposed to non-subscribers. So, in lines 11-15, we take user names and only store them in an array if their feed is public. The upshot of all that is that of 1 323 users that I can access, we use only 1 128 with public feeds.

Lines 17-30 process each TLS subscriber and their subscriptions, storing the results in the network array. Only subscriptions to users who are also subscribed to TLS are stored (lines 24-28), plus the subscription to TLS (line 23). There’s also a small pause between feeds, just in case we’re hammering the API too hard.

At the end, each user-subscription pair is written to a CSV file, one pair per line. Total = 17 468 lines, some of which look like this:

neilfws,the-life-scientists
neilfws,aroy
neilfws,abhishektiwari
neilfws,adamk
neilfws,adrianh
neilfws,ajc
....

2. Using igraph in R
Igraph can be used in C, Python, Ruby or R – let’s go with R. Reading in the file and converting to an igraph object is straightforward. I’m assuming that the subscriptions graph is directed, in that you subscribing to me has direction you –> me and vice versa, me –> you.

library(igraph)
tls <- read.csv("network.csv", header = F)
g <- graph.data.frame(tls, directed = T)
ecount(g)
[1] 17468
vcount(g)
[1] 1129

Voilà – a graph with 1 129 vertices (nodes) and 17 468 edges (connections). What can we do with that? Well, there are a few simple summary metrics, such as graph diameter:

# network diameter
diameter(g)
[1] 7
# show the farthest nodes
farthest.nodes(g)
[1] 133 928   7

That tells us that the average minimum distance between a pair of vertices in the TLS network is 7 people – no surprise there.

However, the real meat of igraph comes with methods that describe edges, vertices, subgraphs, communities and so on. First example: the largest “clique”. A clique is a maximally-connected subgraph in which every vertex connects to every other vertex.

lc <- largest.cliques(g)
lc
[[1]]
 [1]   87  111  117  120  139  260  293  326  381  484  500  549  663  681  705
[16]  725  738  753  770  775  794  832  834  847  885  962 1073 1120 1128

I know that you want the identity of these individuals, so here’s a quick and dirty plot.

# create a new graph of the largest clique
V(g)$label <- V(g)$name
g.lc <- subgraph(g, lc[[1]])
png(filename = "lc.png")
plot(g.lc, layout=layout.fruchterman.reingold, vertex.color="gray60", vertex.size = 0, edge.arrow.size = 0.5, edge.color = "gray80")
dev.off()

The result is the left figure, below; click on the image for a larger version.
Adding properties to vertices and edges is easy, using the V or E functions, respectively. You can add whatever you like: for example, V(g)$foo <- bar, provided that bar is a vector of length equal to number of vertices. Here’s an example in which we try to find communities in the TLS network using a model from statistical mechanics called spin-glass [1, 2], label the vertices and change vertex colour and size:

sgc <- spinglass.community(g)
V(g)$membership <- sgc$membership
# found 4 communities 0, 1, 2, 3
V(g) [ membership == 0 ]$color <- "cyan"
V(g) [ membership == 1 ]$color <- "green"
V(g) [ membership == 2 ]$color <- "blue"
V(g) [ membership == 3 ]$color <- "red"
V(g)$size <- 4
V(g) [ name == "the-life-scientists" ]$size <- 20
> png(filename = "tls.png", height = 800, width = 800)
> plot(g, layout=layout.fruchterman.reingold, vertex.color=V(g)$color, vertex.size = V(g)$size, vertex.label = NA, edge.arrow.size = 0.5)
> dev.off()

Result at far-right.

tls

Community detection in the TLS network


lcg

TLS largest clique subgraph


The large blue circle in the centre is The Life Scientists group. One community, the blue circles around the edge, is composed of people who subscribe to the group but do not subscribe to many (or any) members of the group. As for the red, cyan and green communities – more investigation is required. I can only tell you that I belong to the “red” community, where I also see many names that I would recognise as (bio)informaticians, computational biologists and active researchers in biology.

I still have plenty to learn about both igraph and network analysis; hopefully, this is a useful initial exploration.

References
1. J. Reichardt and S. Bornholdt (2006)
Statistical Mechanics of Community Detection
Phys. Rev. E (74), 016110

2. M. E. J. Newman and M. Girvan (2004)
Finding and evaluating community structure in networks
Phys. Rev. E (69), 026113


Filed under: programming, R, research diary, ruby, statistics Tagged: graph theory, igraph, networks

To leave a comment for the author, please follow the link and comment on his blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , , ,

Comments are closed.