Nick Stokes Distance code, now with Big Memory

Posted on April 12, 2012 by Steven Mosher in R bloggers | 0 Comments

[This article was first published on Steven Mosher's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post I was struggling with getting a big memory version of the distance matrix to work fast. Nick and other readers had some suggestions and after puttering around with Nicks code I’ve adapted it to big memory and not impacted the run time very much. For comparison writing a 10K by 10K matrix was taking minutes and I projected about 24 hours for berkeley data. Nick came to the rescue with two really cool ideas. The first is a different method for calculating distance on the sphere. Here he expanded on my idea of pre computing sin() and cos() for latitudes and longitudes and he developed an approach that allows us to compute these values once and then solve a distance equation using super fast matrix maths. On top of that he came up with a “block” approach. At first this looks like it will be slower because we calculate the entire matrix as oppsoed to just the upper half. In his approach Nick wrote out files for every block. In my approach I write blocks to a filebacked big matrix. Thats a bit slower but I get one file that I can then random access. Here’s the code

createDistanceMatrix <- function(Inventory, blockSize = 20, Directory= getwd(), filename
= “distance.bin”){
# based on code by Nick Stokes

require(bigmemory)
options(bigmemory.allow.dimnames=TRUE)
if(!grepl(“\\.bin”,filename))stop(“please use a bin extension”)
columnsPresent <- intersect(colnames(Inventory),c(“Id”,”Lon”,”Lat”))
if(length(columnsPresent) != 3) stop(“missing the correct column names”)
descName <- sub(“bin”,”desc”,filename)
if(class(Inventory) == “matrix”) Inventory <- as.data.frame(Inventory)

L <- cbind(Inventory$Lat,Inventory$Lon)*(pi/180)
s <- cbind(cos(L),sin(L))
# z is the array of 3D coords of stations. All unit vecs
z <- cbind(s[,1]*s[,2],s[,1]*s[,4],s[,3])

z2 <- z/sqrt(2) # Better to mult now than after outer prod
D <- 6371*2 #Diam of Earth

n <- nrow(L)
if(n <= blockSize)stop(“BlockSize must be less than rows in Inventory”)
blocks <- n %/% blockSize
if((n %% blockSize) > 0)blocks <- blocks + 1
dex <- 1:blockSize

BM <- filebacked.big.matrix(nrow = n , ncol = n,
dimnames = list(as.list(Inventory$Id),NULL),
init = NA,
backingpath = Directory,
backingfile = filename,
descriptorfile = descName,
type = “double”)
startTime <- Sys.time()