Nick Stokes Distance code, now with Big Memory

[This article was first published on Steven Mosher's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post I was struggling with getting a big memory version of the distance matrix to work fast. Nick and other readers had some suggestions and after puttering around with Nicks code I’ve adapted it to big memory and not impacted the run time very much. For comparison writing a 10K by 10K  matrix was taking  minutes and I projected about 24 hours for berkeley data. Nick came to the rescue with two really cool ideas. The first is a different method for calculating distance on the sphere. Here he expanded on my idea of pre computing sin() and cos() for latitudes and longitudes and he developed an approach that allows us to compute these values once and then solve a distance equation using super fast matrix maths. On top of that he came up with a “block” approach. At first this looks like it will be slower because we calculate the entire matrix as oppsoed to just the upper half.  In his approach Nick wrote out files for every block. In my approach I write blocks to a filebacked big matrix. Thats a bit slower but I get one file that I can then random access. Here’s the code

createDistanceMatrix <- function(Inventory, blockSize = 20, Directory= getwd(), filename
                                                                                    = “distance.bin”){
# based on code by Nick Stokes

            require(bigmemory)
            options(bigmemory.allow.dimnames=TRUE)
           if(!grepl(“\\.bin”,filename))stop(“please use a bin extension”)
           columnsPresent <- intersect(colnames(Inventory),c(“Id”,”Lon”,”Lat”))
           if(length(columnsPresent) != 3) stop(“missing the correct column names”)
           descName <- sub(“bin”,”desc”,filename)
           if(class(Inventory) == “matrix”) Inventory <- as.data.frame(Inventory)

          L <- cbind(Inventory$Lat,Inventory$Lon)*(pi/180)
          s <- cbind(cos(L),sin(L))
        # z is the array of 3D coords of stations. All unit vecs
         z <- cbind(s[,1]*s[,2],s[,1]*s[,4],s[,3])

             z2 <- z/sqrt(2) # Better to mult now than after outer prod
            D <- 6371*2 #Diam of Earth

          n <- nrow(L)
          if(n <= blockSize)stop(“BlockSize must be less than rows in Inventory”)
         blocks <- n %/% blockSize
         if((n %% blockSize) > 0)blocks <- blocks + 1
          dex <- 1:blockSize

              BM <- filebacked.big.matrix(nrow = n , ncol = n,
                                                                                dimnames = list(as.list(Inventory$Id),NULL),
                                                                                init = NA,
                                                                               backingpath = Directory,
                                                                               backingfile = filename,
                                                                              descriptorfile = descName,
                                                                              type = “double”)
                              startTime <- Sys.time()

                              for(i in 1:blocks){

                                        p <- dex + (i-1)*10
                                        p <- p[p<= n]
                                         for(j in 1:blocks){
                                             q <- dex +(j-1)*10
                                             q <- q[q<=n]
                                             x <- 0.5000000001-z2[p,]%*%t(z2[q,])

                                            x <- D * asin(sqrt(x))
                                           if(identical(p,q))diag(x)<-0
                                          BM[p,q]<- x

                           }
             }

print(Sys.time()-startTime)
return(BM)
}

That takes 46 seconds.  orders of magintude improvement


To leave a comment for the author, please follow the link and comment on their blog: Steven Mosher's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)