I played with Gephi for several hours to learn it (its kewl) and to impress my daughter (her dad is no fireman, who saves people, but he can do nifty things with a computer 😉 I was able to discover interesting facts from the data, including:
– Avg. degree: the number of edges/connections attached to a node
– Network diameter: The longest path between the nodes in the graph
– Average path length: In how many steps (on avg) can one can reach any node from any other node in the graph
– Degree power law: The higher this number, the more unequal is the distribution of connections within the network, which means that some nodes are very well-connected and some are not at all
– Average clustering coefficient: Shows how well the nodes are embedded in their neighborhood i.e. is there a “small world” effect within the network
– Modularity: The higher this parameter, the more defined are the communities within the network. A result of 0.4 or more is usually considered meaningful
– Betweenness centrality was calculated for each node, which shows how often the node appears on the shortest path between any two random nodes in the network. The higher this parameter, the more influential the node is. The nodes which have high betweenness centrality are not necessarily the ones that have the most connections and don’t have to be the most “popular” ones Here’s a video of Gephi features (older version)
I was also able to discover interesting patterns in the data, like the communities that emerge, popular people and the connectors. See the graph below to see nodes with different colors (communities), size (popularity) and how most connections between the two communities flow through a few nodes (connections)
Lessons learned so far:
– I quickly realized that you need a good machine for using Gephi effectively (a good video card, enough memory and fast CPU)
– There is a great, active community behind Gephi, so expect frequent releases to resolve critical issues and new features. I’m waiting for this month’s release that fixes some issues I’ve faced 🙂 If you know of other tools to visually explore graphs, please leave a comment.
Which network data I used? There’s a wealth of network data available today, including social networking sites, phone logs, work history, chat logs, email logs etc. I decided to create test data for email traffic to test the hypothesis of “who we send emails to or receive emails from” is a good indicator of our social network. My test data has 50,000+ nodes and 150,000+ edges.
I used R to create the create data for graph. R is good at handling millions of rows of data and is powerful for data manipulation (cleaning, creating edge lists, adjacency matrix etc.). I used it to format raw email traffic test data into graph formats (edgelist, adjacency matrix etc.) It took me a couple of hours to write code for creating the data set to feed into Gephi. Using R to solve a real need has been a good learning experience so far.
Leave a comment, if you’re interested in seeing the code.