Test Drive of Parallel Computing with R

[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today, I did a test run of parallel computing with snow and multicore packages in R and compared the parallelism with the single-thread lapply() function.

In the test code below, a data.frame with 20M rows is simulated in a Ubuntu VM with 8-core CPU and 10-G memory. As the baseline, lapply() function is employed to calculate the aggregation by groups. For the comparison purpose, parLapply() function in snow package and mclapply() in multicore package are also used to generate the identical aggregated data.

n <- 20000000
set.seed(2013)
df <- data.frame(id = sample(20, n, replace = TRUE), x = rnorm(n), y = runif(n), z = rpois(n, 1))

library(rbenchmark)
benchmark(replications = 5, order = "user.self",
  LAPPLY = {
  cat('LAPPLY...\n')
  df1 <- data.frame(lapply(split(df[-1], df[1]), colMeans))
  },
  SNOW = {
  library(snow)
  cat('SNOW...\n')
  cl <- makeCluster(8, type = "SOCK")
  df2 <- data.frame(parLapply(cl, split(df[-1], df[1]), colMeans))
  stopCluster(cl)
  },
  MULTICORE = {
  cat('MULTICORE...\n')
  library(multicore)
  df3 <- data.frame(mclapply(split(df[-1], df[1]), colMeans, mc.cores = 8))
  }
)

library(compare)
all.equal(df1, df2)
all.equal(df1, df3)

Below is the benchmark output. As shown, the parallel solution, e.g. SNOW or MULTICORE, is 3 times more efficient than the baseline solution, e.g. LAPPLY, in terms of user time.

       test replications elapsed relative user.self sys.self user.child
3 MULTICORE            5 101.075    1.000    48.587    6.620    310.771
2      SNOW            5 127.715    1.264    53.192   13.685      0.012
1    LAPPLY            5 184.738    1.828   179.855    4.880      0.000
  sys.child
3     7.764
2     0.740
1     0.000

Attaching package: ‘compare’

The following object is masked from ‘package:base’:

    isTRUE

[1] TRUE
[1] TRUE

In order to illustrate the CPU usage, multiple screenshots have also been taken to show the difference between parallelism and single-thread.

In the first screenshot, it is shown that only 1 out of 8 CPUs is used at 100% with lapply() function and the rest 7 are idle.
Screenshot from 2013-05-25 22:14:18

In the second screenshot, it is shown that all 8 CPUs are used at 100% with parLapply() function in the snow package.
Screenshot from 2013-05-25 22:16:47

In the third screenshot, it is also shown that all 8 CPUs are used at 100% with mulapply() function in the multicore package.
Screenshot from 2013-05-25 22:18:40


To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)