March 3, 2013
By

(This article was first published on metvurst, and kindly contributed to R-bloggers)

Recently, I have been doing some analysis for a project I am involved in. In particular, I was interested what role pacific sea surface temperatures play with regard to rainfall in East Africa. I spare you the details as I am currently writing all this up into a paper which you can have a look at once published.

For this analysis, however, I am processing quite an amount of raster files. This led me to investigate the possibilities of the parallel package to speed up the process.

Here's a quick example on how to read in raster data (in this case 460 global sea surface temperature files of 1° x 1° degree resolution) using parallel

First, lets do it the conventional way and see how long that takes

library(raster)
library(rgdal)

### Input preparation ########################################################
inputpath <- "/media/tims_ex/sst_kili_analysis"

### list files in direcotry ##################################################
fnames_sst_r2 <- list.files(inputpath,
pattern = glob2rx(ptrn),
recursive = T)

### read into raster format ##################################################
system.time({
sst.global <- lapply(seq(fnames_sst_r2), function(i) {
raster(paste(inputpath, fnames_sst_r2[i], sep = "/"))
}
)
})

##    user  system elapsed
##  61.584   0.412  68.104


Now using library(parallel)

library(parallel)

system.time({
### set up cluster call ######################################################
cl <- makePSOCKcluster(4)

clusterExport(cl, varlist = c("inputpath", "fnames_sst_r2"),
envir=environment())
junk <- clusterEvalQ(cl, c(library(raster),
library(rgdal)))

### read into raster format using parallel version of lapply #################
sst.global.p <- parLapply(cl, seq(fnames_sst_r2), function(i) {
raster(paste(inputpath, fnames_sst_r2[i], sep = "/"))
}
)

### stop the cluster #########################################################
stopCluster(cl)
})

##    user  system elapsed
##   0.152   0.080  25.670


Not a crazy speed enhancement, but we need to keep in mind that the raster command does not read into memory. Hence, the speed improvements should be a lot higher once we start the calculations or plotting.

Finally, let's test whether the two methods produce identical results.

identical(sst.global.p, sst.global)

## [1] TRUE


to be continued…

sessionInfo()

## R version 2.15.3 (2013-03-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=C                 LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods
## [8] base
##
## other attached packages:
## [1] rgdal_0.8-5   raster_2.0-41 sp_1.0-5      knitr_1.1
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.3    evaluate_0.4.3  formatR_0.7     grid_2.15.3
## [5] lattice_0.20-13 stringr_0.6.2   tools_2.15.3