Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Alright, let’s test some parallelization functionalities in R.

The machine:
MacBook Air (mid-2013) with 8 GB of RAM and the i7 CPU (Intel i7 Haswell 4650U). This CPU is hyper-threaded, meaning (at least that’s my understanding of it) that it has two physical cores but can run up to four threads.

Draw a number of cases from a normal distribution with a mean of 10 and a standard deviation of 30. Do this a hundred times and combine the result in one vector. The number of cases is varied from half a million to two millions. The number of cores used by R is also varied (between 1 and 4). All this is done 5 times, hence we get multiple estimates of each run’s properties. Altogether, 80 runs are made: 5 times x 4 n-cores x 4 n-cases = 80 runs.

The results:

This is quite interesting: We clearly see that there is virtually no performance gain for the 3- and 4-core runs. I guess this is because we do not really have 4 physical cores available on the hyper-threaded CPU. So, it does not really make a difference if we assign 2 or 3 or 4 cores to a task on a hyper-threaded CPU. The performance gain from 1 to 2 cores, however, is quite clear.

Code (plotting code not supplied):
library(doParallel)
library(parallel)
result.df <- data.frame()

for (i in 1:5) {
cat(i,”\n”)
for (cases in c(500000, 1000000, 1500000, 2000000)) {
cat(cases, “\n”)
for (cores in c(1,2,3,4)) {
n.cores <- cores
n.cases <- cases
cluster <- makeCluster(n.cores)
registerDoParallel(cluster)
t1 <- Sys.time()
result.vec <- foreach(i = 1:100, .combine=c) %dopar% {
rnorm(n.cases, mean = 10, sd = 30)
}
difft <- difftime(Sys.time(), t1, units = "secs")
result.df <- rbind(result.df, c(n.cores, n.cases, difft))
}}}