**Christophe Ladroue » R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently I had to partition a set of SNPs into a training set and a test set. Making a random split would not do: both sets would likely contain very similar SNPs due to linkage disequilibrium (LD), making them non-independent. So what I really needed was a partition that minimises the LD between the two sets.

The problem is equivalent to bisecting a graph (with nodes being the SNPs, linked with LD) and it’s NP hard so we shouldn’t expect to always find the optimal solution. There are very good heuristic algorithms for it though, and I decided to implement the Kernighan/Lin algorithm, which gives a good solution in . See the code at the end of this post.

A first, direct implementation using loops convinced me that it was working well, but as I needed to process a 2000×2000 matrix, efficiency rapidly became an issue: bisecting a 1000×1000 matrix was taking 6 min and the time taken was going up at an exponential rate, giving an estimated time of completion of 1 hour for a 2000×2000 matrix. It turned out that the code was very amenable to vectorization and after spending some time putting that in place, I gained almost 2 orders of magnitudes: it takes 43 seconds instead of 55 minutes on test data. In other words, it is actually usable.

The results are really good as well. Here is an example on a non-separable graph. The partition is such that the terms off the (block) diagonal are small. This is much better than a random partition.

N<-100 topMatrix<-matrix(round(runif(N*N)*50),nrow=N,ncol=N) bottomMatrix<-matrix(round(runif(N*N)*50),nrow=N,ncol=N) smallValues<-matrix(runif(N*N)*1,nrow=N,ncol=N) weightMatrix<-rbind( cbind(topMatrix+t(topMatrix),smallValues), cbind(t(smallValues),bottomMatrix+t(bottomMatrix))) partition<-approximateBisection(weightMatrix)

(the diagonal goes from the bottom left to the top right)

And here is the code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
## Approximate bisection # returns a bisection of the graph that minimizes the cost using Kernighan/Lin Algorithm # http://www.eecs.berkeley.edu/~demmel/cs267/lecture18/lecture18.html#link_4.2 # partition<-approximateBisection(weightMatrix) # weightMatrix is symmetric matrix of size 2Nx2N made of non-negative values. # partition is a list of two vectors of N indices. # C.Ladroue approximateBisection<-function(weightMatrix,mode="matrix",minimumGain=1e-5){ # minimumGain<-1e-5 # minimum value for gain, setting it to 0 might lead to infinite loop due to numerical inaccuracy N<-dim(weightMatrix)[1] # number of elements m<-N/2 # start off with a random partition A<-sample(1:N,N/2,replace=FALSE) B<-(1:N)[-A] maxGain<-Inf while(maxGain>minimumGain){ DA<-rowSums(weightMatrix[A,B])-rowSums(weightMatrix[A,A])+diag(weightMatrix[A,A]) DB<-rowSums(weightMatrix[B,A])-rowSums(weightMatrix[B,B])+diag(weightMatrix[B,B]) unmarkedA<-1:m unmarkedB<-1:m markedA<-rep(0,m) markedB<-rep(0,m) gains<-rep(0,m) for(k in 1:m){ # find best pair from remainder # with 2 loops, slow but easy on memory if(mode=='2loops'){ bestGain<--Inf besti<-0 bestj<-0 for(i in unmarkedA) for(j in unmarkedB){ gain<-DA[i]+DB[j]-2*weightMatrix[A[i],B[j]] if(gain>bestGain) {bestGain<-gain; besti<-i;bestj<-j} } # mark the best pair unmarkedA<-unmarkedA[-which(unmarkedA==besti)] unmarkedB<-unmarkedB[-which(unmarkedB==bestj)] markedA[k]<-besti markedB[k]<-bestj } # with one matrix, much faster but builds a matrix as large as weightMatrix if(mode=='matrix'){ dimension<-m+1-k fasterGain<-matrix(DA[unmarkedA],nrow=dimension,ncol=dimension,byrow=FALSE)+ matrix(DB[unmarkedB],nrow=dimension,ncol=dimension,byrow=TRUE)- 2*weightMatrix[A[unmarkedA],B[unmarkedB]] # mark the best pair best<-arrayInd(which.max(fasterGain),.dim=c(dimension,dimension)) besti<-unmarkedA[best[1]] bestj<-unmarkedB[best[2]] bestGain<-fasterGain[best] markedA[k]<-unmarkedA[best[1]] markedB[k]<-unmarkedB[best[2]] unmarkedA<-unmarkedA[-best[1]] unmarkedB<-unmarkedB[-best[2]] } # record gain gains[k]<-bestGain # update D for unmarked indices DA[unmarkedA]<-DA[unmarkedA]+2*weightMatrix[A[unmarkedA],A[besti]]-2*weightMatrix[A[unmarkedA],B[bestj]] DB[unmarkedB]<-DB[unmarkedB]+2*weightMatrix[B[unmarkedB],B[bestj]]-2*weightMatrix[B[unmarkedB],A[besti]] } gains<-cumsum(gains) bestPartition<-which.max(gains) maxGain<-gains[bestPartition] if(maxGain>minimumGain){ # swap best pairs A1<-c(A[-markedA[1:bestPartition]],B[markedB[1:bestPartition]]) B1<-c(B[-markedB[1:bestPartition]],A[markedA[1:bestPartition]]) A<-A1 B<-B1 } } list(A,B) } |

**leave a comment**for the author, please follow the link and comment on their blog:

**Christophe Ladroue » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.