Inferring the community structure of networks

May 21, 2012
By

(This article was first published on sieste » R, and kindly contributed to R-bloggers)

I continue my little excursion into network science. In the last post, I gave a little introduction to simulating and visualizing undirected networks with community structure in R. In this post I want to explore a method to infer the community structure of a network from its adjacency matrix. That is, given that I know which nodes are connected to each other, I want to infer which nodes belong to the same community. I will focus on the case of two communities, but the method can be extended to three or more.

This post requires basic knowledge of matrix calculus, in particular eigen-decomposition. Understanding how the k-means clustering algorithm works is also useful. The R-code to reproduce all the figures is given at the bottom of the post.

The Laplacian of the adjacency matrix

Given N nodes of the network and their links, the adjacency matrix A is the N*N matrix whose element A_ij = 1 if node i is connected to node j, and A_ij = 0 otherwise. In an undirected network the adjacency matrix is symmetric, and since no “self-links” are allowed A_ii = 0 for all i.

In a network with community structure the adjacency matrix is block-diagonal, if the first n_1 nodes belong to community 1, the next n_2 links belong to community 2, and so on until n_K. Here is an example with K=2:

(If the labels of the nodes are not nicely ordered as above, the block-diagonal structure might not be immediately visible; I will come to this point later.)

The Laplacian L of the network is the N*N matrix that has L_ij = -1 if node i and j are connected and L_ij = 0 otherwise. The i-th diagonal element L_ii is equal to the total number of links of node i.

The Laplacian has a number of useful properties:

  • if the adjacency matrix is block-diagonal, the Laplacian is block-diagonal as well
  • all row sums and column sums of the Laplacian are zero
  • since the rows and columns sum to zero, one of the eigenvectors of the Laplacian has constant elements v_i = (c,c,c,…,c) and the corresponding eigenvalue e_i = 0
  • the block-diagonal structure implies that the Laplacian has K eigenvectors with corresponding eigenvalue zero, where K is the number of blocks
  • the elements of these eigenvectors are constant inside each block.
  • Look at the following block-diagonal Laplacian matrix with N=10, K=2 (5 nodes in each community) and its eigen-decomposition:

    The eigenvalue spectrum shows that the last two eigenvalues are zero. The corresponding eigenvectors are shown as well (the blue one is shifted upwards by one unit). Their elements are constant inside the blocks.

    In the following, we use the eigenvalue spectrum to infer the number of communities, and we use the eigenvectors to infer which nodes belong to which community.

    Inferring the number of communities

    As shown above, the number of blocks of the Laplacian can be inferred from the number of vanishing eigenvalues if the block-diagonal structure is perfect. If there are links between nodes in different communities, the Laplacian is only approximately block-diagonal since some off-diagonal elements are nonzero. Therefore, last few eigenvalues will not vanish exactly, but they will be close to zero. The following Laplacian corresponds to a the same network as above, but where the link between nodes 1 and 2 has been deleted, and a link between nodes 3 and 9 has been added.

    The number of small eigenvalues, and thus the number of communties can be chosen by “eyeballing” the eigenvalue spectrum for where the cutoff occurs. Or one can automate the selection of K.

    I found that the k-means clustering algorithm (R implementation stats::kmeans) can be used to determine the cutoff. The idea is that the eigenvalue spectrum is quite smooth until a sharp cutoff appears before the almost-vanishing eigenvalues. This cutoff divides the eigenvalues into two clusters and we can make the k-means algorithm look for exactly two clusters. For the above eigenvalue spectrum, k-means correctly finds the size of the smaller cluster to be 2:

    >> ei <- eigen(laplacian)
    >> kmeans(ei$values,centers=range(ei$values))
    ...
     Clustering vector:
     [1]  2 2 2 2 2 2 2 2 1 1 
    ...
    

    Inferring the community structure

    I shuffled the labels of the nodes in the figure above but kept their links unchanged. The new Laplacian and its eigen-decomposition now looks as follows:

    Do you still see the community structure in the Laplacian? I don’t. But one of the eigenvectors can see it.

    There are still two very small eigenvalues, indicating two communities, which is the correct number.

    The second to last eigenvector (blue) indicates the community structure. Running the k-means clustering algorithm on the elements of this vector yields:

    >> kmeans(ei$vectors[,9],centers=range(ei$vectors[,9]))
    ...
     Clustering vector:
     [1]   1 1 2 1 2 1 2 2 2 1
    ...
    

    For comparison, the original node labels were

    >> shuf.ind
     [1]  6  7  1 10  4  8  3  2  5  9
    

    That is nodes 1,2,4,6 and 10 (which correspond to nodes 6-10 under the original labeling) are put into one cluster and the others respectively. By applying the k-means algorithm to the eigenvector elements, we reconstructed the community structure perfectly. I found this method to work also for larger values of N. If K increases, you will have to apply the above procedure iteratively. If more links exist between nodes in different communities, misclassification or overestimation of the number of communities occurs.

    The goal of this post was to illustrate a naive method for inferring unknown network properties, namely the community structure. All I know about the network Laplacian and its eigen-decomposition is based on reference [1]. No thourough testing of the above method was performed, and I have no detailed results about misclassification behavior or under/over-estimation of the number of clusters. Please check [1] and references therein if you are curious about more advanced methods for community detection.

    I find the above method quite visual and understandable. In case the k-means algorithm returns completely unreasonable results, the number of nodes and the separation of the eigenvector elements could as well be performed by hand if the number of communities is small. I consider this a useful property which more advance methods might not have.

    If you find errors or something that remained unclear, feel free to leave a comment. This is it, thanks for watching.

    References

    [1] Newman (2004) Detecting community structure in networks, Eur. Phys. J. B 38, DOI:10.1140/epjb/e2004-00124-y

    The R code to produce the above figures:

    
    graphics.off()
    JPG <- TRUE
    
    #################################
    # block-diagonal adjacency matrix
    # Figure 1
    #################################
    
    if (JPG) {
        jpeg("adjacency-mat.jpg",width=400,height=300,quality=100)
    } else {
        x11()
    }
    par(cex.lab=2,cex.main=2,cex.axis=2)
    #construct block diagonal adjacency matrix
    v1 <- c(rep(1,5),rep(0,5))
    v2 <- c(rep(0,5),rep(1,5))
    blomat <- rbind(v1,v1,v1,v1,v1,v2,v2,v2,v2,v2)
    diag(blomat) <- 0
    layout(t(c(1,1,1,1,2)))
    #plot adj. matrix
    image(blomat[,10:1],asp=1,axes=F,col=rainbow(2), main="adjacency matrix")
    #colorbar
    par(mar=c(7,2,7,5))
    image(matrix(seq(2),ncol=2),axes=F,col=rainbow(2))
    axis(side=4,at=seq(0,1,length.out=2),labels=paste(c(0,1)), cex.axis=2)
    if (JPG) dev.off()
    
    
    ############################
    # perfect two block laplacian + eigenspectrum
    # Figure 2
    ############################
    #construct laplacian
    laymat <- rbind(c(1,1,1,1,1,2),c(3,3,3,4,4,4))
    v1 <- c(rep(-1,5),rep(0,5))
    v2 <- c(rep(0,5),rep(-1,5))
    blomat <- rbind(v1,v1,v1,v1,v1,v2,v2,v2,v2,v2)
    diag(blomat) <- 4
    if (JPG) {
        jpeg("laplace-perfect.jpg",width=500,height=500, quality=100)
    } else {
        x11()
    }
    par(cex.lab=2,cex.main=2,cex.axis=2)
    layout(laymat)
    #plot laplacian
    image(blomat[,10:1],asp=1,axes=F,col=rainbow(6), main="Laplacian")
    #colorbar
    par(mar=c(7,2,7,5))
    image(matrix(seq(3),ncol=3),axes=F,col=rainbow(6)[c(1,2,6)])
    axis(side=4,at=seq(0,1,length.out=3),labels=paste(c(-1,0,4)), cex.axis=2)
    par(mar=c(4,6,4,4))
    #eigendecomp.
    ei <- eigen(blomat)
    #plot eigenvalues
    plot(ei$values,main="eigenvalues")
    #plot eigenvectors
    plot(NULL,xlim=c(1,10),ylim=c(-1,2),main="last two eigenvectors", ylab=NA)
    lines(ei$vectors[,9]+1,type="o",col="blue")
    lines(ei$vectors[,10],type="o",col="orange")
    if (JPG) dev.off()
    
    ############################
    # block-diagonal laplacian with impurities
    # Figure 3
    ############################
    #construct the laplacian
    v1 <- c(rep(-1,5),rep(0,5))
    v2 <- c(rep(0,5),rep(-1,5))
    blomat <- rbind(v1,v1,v1,v1,v1,v2,v2,v2,v2,v2)
    blomat[1,2] <- blomat[2,1]<- 0
    blomat[3,9] <- blomat[9,3]<- -1
    diag(blomat) <- 0
    diag(blomat) <- -colSums(blomat)
    if (JPG) {
        jpeg("laplace-imperfect.jpg",width=500,height=500, quality=100)
    } else {
        x11()
    }
    par(cex.lab=2,cex.main=2,cex.axis=2)
    layout(laymat)
    #plot the laplacian
    image(blomat[,10:1],asp=1,axes=F,col=rainbow(7), main="Laplacian")
    par(mar=c(7,2,7,5))
    #colorbar
    image(matrix(seq(7),ncol=7),axes=F,col=rainbow(7))
    axis(side=4,at=seq(0,1,length.out=7),labels=paste(seq(-1,5)), cex.axis=2)
    par(mar=c(4,6,4,4))
    #eigendecomposition and k-means clustering
    ei <- eigen(blomat)
    print(kmeans(ei$values,range(ei$values)))
    #plot eigenvalues
    plot(ei$values,main="eigenvalues")
    #plot eigenvectors
    plot(NULL,xlim=c(1,10),ylim=c(-1,2), main="last two eigenvectors", ylab=NA)
    lines(ei$vectors[,9]+1,type="o",col="blue")
    lines(ei$vectors[,10],type="o",col="orange")
    if (JPG) dev.off()
    
    
    ###############################
    #disordered block-diagonal laplacian with impurities
    # Figure 4
    ###############################
    #construct Laplacian
    v1 <- c(rep(-1,5),rep(0,5))
    v2 <- c(rep(0,5),rep(-1,5))
    blomat <- rbind(v1,v1,v1,v1,v1,v2,v2,v2,v2,v2)
    blomat[1,2] <- blomat[2,1]<- 0
    blomat[3,9] <- blomat[9,3]<- -1
    shuf.ind <- c(6, 7, 1, 10, 4, 8, 3, 2, 5, 9)
    blomat <- blomat[shuf.ind, shuf.ind]
    diag(blomat) <- 0
    diag(blomat) <- -rowSums(blomat)
    
    if (JPG) {
        jpeg("laplace-disordered.jpg",width=500,height=500, quality=100)
    } else {
        x11()
    }
    par(cex.lab=2,cex.main=2,cex.axis=2)
    layout(laymat)
    #plot Laplacian
    image(blomat[,10:1],asp=1,axes=F,col=rainbow(7), main="Laplacian")
    par(mar=c(7,2,7,5))
    #plot colorbar
    image(matrix(seq(7),ncol=7),axes=F,col=rainbow(7))
    axis(side=4,at=seq(0,1,length.out=7),labels=paste(seq(-1,5)), cex.axis=2)
    par(mar=c(4,6,4,4))
    #eigendecomposition
    ei <- eigen(blomat)
    #plot eigenvalues
    plot(ei$values,main="eigenvalues")
    #plot eigenvectors
    plot(NULL,xlim=c(1,10),ylim=c(-1,2), main="last two eigenvectors", ylab=NA)
    lines(ei$vectors[,9]+1,type="o",col="blue")
    lines(ei$vectors[,10],type="o",col="orange")
    if (JPG) dev.off()
    
    

    To leave a comment for the author, please follow the link and comment on his blog: sieste » R.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Comments are closed.