Nick Stokes Distance code, now with Big Memory

April 12, 2012
By

(This article was first published on Steven Mosher's Blog, and kindly contributed to R-bloggers)

In my last post I was struggling with getting a big memory version of the distance matrix to work fast. Nick and other readers had some suggestions and after puttering around with Nicks code I’ve adapted it to big memory and not impacted the run time very much. For comparison writing a 10K by 10K  matrix was taking  minutes and I projected about 24 hours for berkeley data. Nick came to the rescue with two really cool ideas. The first is a different method for calculating distance on the sphere. Here he expanded on my idea of pre computing sin() and cos() for latitudes and longitudes and he developed an approach that allows us to compute these values once and then solve a distance equation using super fast matrix maths. On top of that he came up with a “block” approach. At first this looks like it will be slower because we calculate the entire matrix as oppsoed to just the upper half.  In his approach Nick wrote out files for every block. In my approach I write blocks to a filebacked big matrix. Thats a bit slower but I get one file that I can then random access. Here’s the code

createDistanceMatrix <- function(Inventory, blockSize = 20, Directory= getwd(), filename
                                                                                    = “distance.bin”){
# based on code by Nick Stokes

            require(bigmemory)
            options(bigmemory.allow.dimnames=TRUE)
           if(!grepl(“\\.bin”,filename))stop(“please use a bin extension”)
           columnsPresent <- intersect(colnames(Inventory),c(“Id”,”Lon”,”Lat”))
           if(length(columnsPresent) != 3) stop(“missing the correct column names”)
           descName <- sub(“bin”,”desc”,filename)
           if(class(Inventory) == “matrix”) Inventory <- as.data.frame(Inventory)

          L <- cbind(Inventory$Lat,Inventory$Lon)*(pi/180)
          s <- cbind(cos(L),sin(L))
        # z is the array of 3D coords of stations. All unit vecs
         z <- cbind(s[,1]*s[,2],s[,1]*s[,4],s[,3])

             z2 <- z/sqrt(2) # Better to mult now than after outer prod
            D <- 6371*2 #Diam of Earth

          n <- nrow(L)
          if(n <= blockSize)stop(“BlockSize must be less than rows in Inventory”)
         blocks <- n %/% blockSize
         if((n %% blockSize) > 0)blocks <- blocks + 1
          dex <- 1:blockSize

              BM <- filebacked.big.matrix(nrow = n , ncol = n,
                                                                                dimnames = list(as.list(Inventory$Id),NULL),
                                                                                init = NA,
                                                                               backingpath = Directory,
                                                                               backingfile = filename,
                                                                              descriptorfile = descName,
                                                                              type = “double”)
                              startTime <- Sys.time()

                              for(i in 1:blocks){

                                        p <- dex + (i-1)*10
                                        p <- p[p<= n]
                                         for(j in 1:blocks){
                                             q <- dex +(j-1)*10
                                             q <- q[q<=n]
                                             x <- 0.5000000001-z2[p,]%*%t(z2[q,])

                                            x <- D * asin(sqrt(x))
                                           if(identical(p,q))diag(x)<-0
                                          BM[p,q]<- x

                           }
             }

print(Sys.time()-startTime)
return(BM)
}

That takes 46 seconds.  orders of magintude improvement


To leave a comment for the author, please follow the link and comment on his blog: Steven Mosher's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags:

Comments are closed.