Parallel R (and air travel)

(This article was first published on On The Lambda, and kindly contributed to R-bloggers)

My heart sinks a little when I check on my laptop in the morning and the computation I started the night before still hasn’t finished. Even when the data I’m playing with isn’t particularly.... large... (I’m not going to say it), I have a knack for choosing expensive algorithms late at night.

Because of my reluctance to get remote servers and tmux involved at ungodly hours, I’ve been exploring ways to speed up my code without too much trouble (I take the first of the Three virtues of programming very seriously) and without having to translate from my native tongue: R.

  • writing idiomatic R
  • taking advantage of the way R stores numerics in contiguous blocks of RAM and uses C code for vectorized functions

  • using Rcpp or inline
  • Identify bottlenecks and replace time critical/innermost-loops with C or C++ code

  • distribute processing over machines
  • It’s relatively easy to set up a computing cluster using Amazon Web Services and use the package snow, or a similar package.

  • taking advantage of multiple cores
  • Besides for writing idiomatic R (which is the subject of a future post), I’ve found that the easiest way to get a speedup is to use parallel processing on a single machine using the multicore package. This usually gives me a very substantial speedup, even if the code is less than elegant, as we’ll see.

Testing it:
Just for the sake of choosing a problem that exhibits combinatorial explosion, let's use the multicore package to get the mean distance between all airports in the U.S.

If we think of a network of air-travel as a complete graph where the vertices are airports and a bee-line between any two airports as an edge, we want to find the average length of the edges. The number of edges in a complete graph is \frac{n(n-1)}{2} where n is the number of vertices. If this looks familiar, it’s because it’s equivalent to the binomial coefficient n \choose 2. Intuitively, this makes sense, since every edge connecting two vertices is a unique pairing of two airports. The fact that this number exhibits polynomial growth and that the computation of the distance between two airports does not depend on the distance between any other two, makes this a prime candidate for parallelizing.

The dataset of US airport codes and their longitude and latitude is available here. There are 13,429 airport codes, which makes the total number of combinations 90,162,306. Clearly, extreme measures have to be taken to make this a tractable problem on commodity hardware.

Since the Earth isn’t flat (it's not even, strictly speaking, a perfect sphere) the distance between longitude and latitude degrees is not constant. Luckily, there’s a great package, Imap, to handle conversion from two long/lat points to miles or kilometers.

The code:

library(multicore)
library(Imap)

calc.distance.two.rows <- function(ind1, ind2){
  return(gdist(air.locs[ind1, 3],
               air.locs[ind1, 2],
               air.locs[ind2, 3],
               air.locs[ind2, 2], units="km"))
}

sum.of.distances <- function(a.mat){
  return(sum(apply(a.mat, 2, function(x){ 
                               calc.distance.two.rows(x[1], x[2]) 
                             })))
}

# read airport long/lat data set
air.locs <- read.csv("airportcodes.csv", stringsAsFactors=FALSE)

# read command-line args
args <- commandArgs(TRUE)

# read the number of airports to use (sample size) from the command-line
smp.size <- as.numeric(args[1])

# choose a random sample of airports
sampling <- sample((1:nrow(air.locs)), smp.size)

# shrink dataframe
air.locs <- air.locs[sampling, ]

# create a matrix of unique subsets of indices from the
# data frame that stores the airport information
combos <- combn(1:nrow(air.locs), 2)
num.of.comps <- ncol(combos)

# use single core
single <- function(){
  the.sum <- sum.of.distances(combos)
  result <- the.sum / num.of.comps
  return(result)
}

# use two cores
mult <- function(){
  half <- floor(num.of.comps/2)
  f1 <- parallel(sum.of.distances(combos[, 1:half]))
  f2 <- parallel(sum.of.distances(combos[, (half+1):num.of.comps]))
  the.sum <- sum(as.vector(unlist(collect(list(f1, f2)))))
  result <- the.sum / num.of.comps
  return(result)
}

# compare the execution times (in time elapsed)
perform <- function(){
  first <- system.time(res1 <- single())
  second <- system.time(res2 <- mult())
  cat(smp.size); cat(",first,"); cat(first[3]); cat(","); cat(res1); cat("\n");
  cat(smp.size); cat(",second,"); cat(second[3]); cat(","); cat(res2); cat("\n");
}

perform()

Then, I wrote a wrapper program in python that compared the speeds using sample sizes from 10 to 800 in increments of ten.

The results:

Time comparison between single and multicore execution time in seconds

Time comparison between single and multicore execution time in seconds. The curves were smoothed using LOESS.

The parallelized solution is much faster but it is not twice as fast, as one might expect. This is because, not only is there overhead involved in the process of forking and collecting the result, but part of the program (namely the reading of the dataset) is not parallelized. (For more information, check out Amdehl’s Law).

Some cursory curve-fitting suggests that the single core solution’s execution time is fairly well-modeled by the function .00014(n)(n-1) and the dual core solution is well modeled by .00009(n)(n-1). This would make the total time to completion about 7 hours and 4.5 hours, respectively.

After about 1 hour, I psychosomatically smelled melting plastic from my laptop so I called off the actual distance calculation. But repeated trials with sample sizes of 300 suggested that the sampling distribution of the sample mean (whew) had a mean of about 1,874 km and a standard deviation of 64 km (this is the standard error of the mean). This would suggest that the mean distance between any two US airports is about 1,874 km +- 125 (543 miles +- 78) with a standard deviation of around 1109 km (689 miles).

As this example shows, even a kludge-y solution that uses more than one thread can give you pretty substantial speed increases. In future posts, I plan to cover applying Rcpp to this problem and setting up a cluster to perform this calculation. In the meantime, I'll still be starting things at night and regretting it in the morning.

Footnote: If you're interested in this topic, I suggest you check out this site from CRAN. Also, everything I know about parallel R, I learned from this book.


To leave a comment for the author, please follow the link and comment on his blog: On The Lambda.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.