The plot is a heatmap of an adjacency matrix generated by a weighted directed graph, where the weight is the influence of one product on another. The matrix was reordered using the infoMAP community detection algorithm which just got implemented in the most recent update of the igraph package for R. The variable importance score for each variable on every other variable was calculated by using the randomforest package and also the party package. The permutation test used in the regression tree grown by the party package is more robust than the one used in the randomforest package when dealing with highly correlated variables. The computation was done en parallel on a cluster at Amazon Web Services.
R is a very popular open source data analysis tool. It can connect to any data source and even offers integration with Hadoop. It supports Parallel processing and a has the biggest set of libraries for machine learning. According to CIO.com, R is the #2 big data open-source software to watch. It’s also supported and compatible with IBM and SAS systems. R is even approved by the FDA in clinical trials and is the favorite weapon of choice by many of the most elite data scientists.
The community detection algorithm clusters entities together that form natural islands of entities that influence eachother. In this particular incarnation of the analysis, the matrix was made to study product substitution effects and look for predictors. Sadly, I had to omit the labels because they contain non-disclosable information. Colors range from blue to purple, where purple stands for a big influence, and non-symmetry is a measure of importance. The diagonal is white but doesn’t count.
This analysis could be used to optimize the interaction of machine parts, study klout in social networks or look for substitution goods. A similar application, but using a graph representation of the network based on Wikipedia data, can be found here.