Facebook-class social network analysis with R and Hadoop

May 25, 2012
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In computing, social networks are traditionally represented as graphs: a connection of nodes (people), pairs of which may be connected by edges (friend relationships). Visually, the social networks can then be represented like this:

Network-graph

Social network analysis often amounts to calculating the statistics on a graph like this: the number of edges (friends) connected to a particular node (person), and the distribution of the number of edges connected to nodes across the entire graph. When the graph consists of up to 10 billion elements (nodes and edges), such computations can be done on a single server with dedicated graph software like Neo4j. But bigger networks — like Facebook's social network, which is a graph with more than 60 billion elements — require a distributed solution.

Facebook-friendships

Marko A. Rodriguez, a graph consultant with Aurelius, shows in a blog post how to use R and Hadoop (integrated with Revolution Analytics' RHadoop packages) to analyze Facebook-scale social networks. He first simulates a social network (shown at the top of this post) using R's igraph package, and then distributed the network in the Hadoop cluster with to.dfs function (from the rhdfs package). He then used the mapreduce function (from the rmr package) to write a simple map-reduce algorithm in R to count the number of edges associated with each node:

degree.V <- mapreduce(edge.list, 
    map=function(k,v) keyval(v[2],1), 
    reduce=function(k,v) keyval(k,length(v)))
from.dfs(degree.V)[[1]]

From there, it's another simple map-reduce job to calculate the connectivity statistics for the entire network. For more details on how Marko used RHadoop to perform this analysis, see the entire blog post linked below.

Aurelius blog: 

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.