**Steven Mosher's Blog**, and kindly contributed to R-bloggers)

In my last post I was struggling with getting a big memory version of the distance matrix to work fast. Nick and other readers had some suggestions and after puttering around with Nicks code I’ve adapted it to big memory and not impacted the run time very much. For comparison writing a 10K by 10K matrix was taking minutes and I projected about 24 hours for berkeley data. Nick came to the rescue with two really cool ideas. The first is a different method for calculating distance on the sphere. Here he expanded on my idea of pre computing sin() and cos() for latitudes and longitudes and he developed an approach that allows us to compute these values once and then solve a distance equation using super fast matrix maths. On top of that he came up with a “block” approach. At first this looks like it will be slower because we calculate the entire matrix as oppsoed to just the upper half. In his approach Nick wrote out files for every block. In my approach I write blocks to a filebacked big matrix. Thats a bit slower but I get one file that I can then random access. Here’s the code

**createDistanceMatrix <- function(Inventory, blockSize = 20, Directory= getwd(), filename**

** = “distance.bin”){**

** # based on code by Nick Stokes**

** require(bigmemory)**

** options(bigmemory.allow.dimnames=TRUE)**

** if(!grepl(“\\.bin”,filename))stop(“please use a bin extension”)**

** columnsPresent <- intersect(colnames(Inventory),c(“Id”,”Lon”,”Lat”))**

** if(length(columnsPresent) != 3) stop(“missing the correct column names”)**

** descName <- sub(“bin”,”desc”,filename)**

** if(class(Inventory) == “matrix”) Inventory <- as.data.frame(Inventory)**

** L <- cbind(Inventory$Lat,Inventory$Lon)*(pi/180)**

** s <- cbind(cos(L),sin(L))**

** # z is the array of 3D coords of stations. All unit vecs**

** z <- cbind(s[,1]*s[,2],s[,1]*s[,4],s[,3])**

** z2 <- z/sqrt(2) # Better to mult now than after outer prod**

** D <- 6371*2 #Diam of Earth**

** n <- nrow(L)**

** if(n <= blockSize)stop(“BlockSize must be less than rows in Inventory”)**

** blocks <- n %/% blockSize**

** if((n %% blockSize) > 0)blocks <- blocks + 1**

** dex <- 1:blockSize**

** BM <- filebacked.big.matrix(nrow = n , ncol = n,**

** dimnames = list(as.list(Inventory$Id),NULL),**

** init = NA,**

** backingpath = Directory, **

** backingfile = filename, **

** descriptorfile = descName, **

** type = “double”)**

** startTime <- Sys.time()**

** for(i in 1:blocks){**

** p <- dex + (i-1)*10**

** p <- p[p<= n]**

** for(j in 1:blocks){**

** q <- dex +(j-1)*10**

** q <- q[q<=n]**

** x <- 0.5000000001-z2[p,]%*%t(z2[q,]) **

** x <- D * asin(sqrt(x)) **

** if(identical(p,q))diag(x)<-0**

** BM[p,q]<- x**

** }**

** }**

**print(Sys.time()-startTime)**

** return(BM)**

**}**

That takes 46 seconds. orders of magintude improvement

**leave a comment**for the author, please follow the link and comment on their blog:

**Steven Mosher's Blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...