Community Detection in Networks with R

[This article was first published on Algoritmica: een data blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I mainly post this visualization because I think it’s pretty. It reminds a little of the work by the famous Dutch painter Mondrian. The complete matrix can be found here.

Adjacency Matrix Directed Graph

The plot is a heatmap of an adjacency matrix generated by a weighted directed graph, where the weight is the influence of one product on another. The matrix was reordered using the infoMAP community detection algorithm which just got implemented in the most recent update of the igraph package for R. The variable importance score for each variable on every other variable was calculated by using the randomforest package and also the party package. The permutation test used in the regression tree grown by the party package is more robust than the one used in the randomforest package when dealing with highly correlated variables. The computation was done en parallel on a cluster at Amazon Web Services. 

R is a very popular open source data analysis tool. It can connect to any data source and even offers integration with Hadoop. It supports Parallel processing and a has the biggest set of libraries for machine learning. According to CIO.com, R is the #2 big data open-source software to watch. It’s also supported and compatible with IBM and SAS systems. R is even approved by the FDA in clinical trials and is the favorite weapon of choice by many of the most elite data scientists. 

The community detection algorithm clusters entities together that form natural islands of entities that influence eachother. In this particular incarnation of the analysis, the matrix was made to study product substitution effects and look for predictors. Sadly, I had to omit the labels because they contain non-disclosable information. Colors range from blue to purple, where purple stands for a big influence, and non-symmetry is a measure of importance. The diagonal is white but doesn’t count.

This analysis could be used to optimize the interaction of machine parts, study klout in social networks or look for substitution goods. A similar application, but using a graph representation of the network based on Wikipedia data, can be found here.

To leave a comment for the author, please follow the link and comment on their blog: Algoritmica: een data blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)