Going parallel: understanding load balancing in R

[This article was first published on R – Neural Discharge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

With most computers now having many cores, parallelising your R code is a great way to reduce processing time, and there are lots of packages to help. But if you benchmark your new code, you may find it is not as fast as you expected. For example, you might assume that splitting the work across 4 cores might make the code run 4 times faster, but in fact it could be less than twice as fast. Still an improvement, but not great. This happens when each of the tasks has a different run time, and the tasks have been poorly allocated to the different cores. Thus most of your cores sit idle while one core does most of the work.

Load balancing is the idea of evenly distributing tasks to the cores to minimise overall run time. As we will see, it can make a massive difference to your code performance.

Consider this function, which does nothing for x2 seconds. Obviously not very useful, but the runtime of this function varies dramatically for different values of x.

slow_func <- function(x){
    Sys.sleep(x ** 2)
}

And suppose we have 20 tasks to do.

vals <- rep(1:10, 2)

On a single core this would take 770 seconds (12.8 minutes). Can we speed this up by splitting the task across 4 cores?

The default way that R allocates these tasks out to 4 cores is to break the tasks into 4 batches in the order they are provided. This is done by the internal function parallel:::splitList().

Figure 1: The R default tasks distributed in the order they are provided, 171.4% of the optimal time

The result is that two of the cores get most of the long running tasks while the other two get most of the quick tasks. This is why most of the R packages offer an option to dynamically load balance, although it is usually turned off by default. When dynamic load balancing is used instead of allocating all the task at the beginning R will give each core one task.  Subsequent tasks are allocated to cores as they finish tasks. Thus a core with a short task is quickly reallocated new work, while a core with a slow task may only do one or two tasks for the whole run.

Figure 2: Tasks in the order they are provided and then dynamically load balanced, 150.7% of the optimal time

This significantly reduces the imbalance between cores, but is also far from optimal. Using load balancing also provides an overall performance hit as the workers must spend more time “talking” to the main R process and transferring data back and forth.

Can we do any better? Well, in our example, we know exactly how long each task will take, so we should be able to pre-allocate the task in a more efficient manner. But even in the real world, you likely have some idea of which tasks will be fast and slow. For example, if you are processing many data frames, the number of rows in each data frame may be proportional to the processing time.

If we do have a measure of task duration, we always want to do the slowest tasks first. In the previous example, some of the longest tasks were the last to be allocated, which makes it difficult of the load balancer to allocate the tasks efficiently. But, a word of warning, sorting the tasks without load balancing can be disastrously slow.

Figure 3: Tasks in decreasing order of run time, without dynamic load balancing, 221.3% of the optimal time, a terrible idea

In this case we have given Core 1 all the slow tasks, and Core 4 all the fast tasks. This is about as bad as you can possibly do at load balancing. But if we endable dynamic load balancing, then slowest first works really well. Note that sorting the data changes the order of results, if this matters to your code, you will need to record the original order so that you can “unsort” your data once the parallel process has been completed.

Figure 4: Task in decreasing order of run time, then dynamically load balanced, 100.8% of the optimal time

It is also possible to get a similar distribution of tasks without dynamic load balancing using what I call a zig-zag sort.

zigzag_sort <- function(x, n = 4){
    x <- x[order(x, decreasing = TRUE)]
    sortvec <- rep(c(seq(1,n),seq(n, 1)), length = length(x))
    sortvec <- order(sortvec)
    x <- x[sortvec]
    return(x)
}

This function will sort a vector from highest to lowest but zig zagging across the cores in a 1,2,3,4,4,3,2,1 pattern. The n argument can be changed to reflect the number of cores you have.

Figure 5: Zig-Zag sorted without dynamic load balancing, 102.9% of optimal time.

A zig-zag sort is almost a good as dynamic load balancing, but in the real world would not have the overheads caused by dynamic load balancing. Depending on the nature of your task the best option may vary.

That’s enough theory; how does this work in practice? I wrote some short code to test parallelising my task using different packages. Each function also gives the option to use zig-zag sorting.

method_lapply <- function(vals, sort){
   if(sort){
     vals <- zigzag_sort(vals, 4)
   }
   r <- lapply(vals, slow_func)
   r <- unlist(r)
   return(r)
 }

 method_parlapply <- function(vals, sort){
   if(sort){
     vals <- zigzag_sort(vals, 4)
   }
   cl <- parallel::makeCluster(4)
   r <- parallel::parLapply(cl = cl, vals, slow_func)
   parallel::stopCluster(cl)
   r <- unlist(r)
   return(r)
 }

 method_parlapplyLB <- function(vals, sort){
   if(sort){
     vals <- zigzag_sort(vals, 4)
   }
   cl <- parallel::makeCluster(4)
   r <- parallel::parLapplyLB(cl = cl, vals, slow_func)
   parallel::stopCluster(cl)
   r <- unlist(r)
   return(r)
 }

 method_foreach <- function(vals, sort, preschedule){
   if(sort){
     vals <- zigzag_sort(vals, 4)
   }
   opts <- list(preschedule = preschedule)
   cl <- parallel::makeCluster(4)
   doSNOW::registerDoSNOW(cl)
   boot <- foreach::foreach(i = vals,
                            .export = c("slow_func", "zigzag_sort"),
                            .options.snow = opts)
   r <- foreach::%dopar%(boot, slow_func(i))
   parallel::stopCluster(cl)
   r <- unlist(r)
   return(r)
 }

 method_future_lapply <- function(vals, sort, scheduling){
   if(sort){
     vals <- zigzag_sort(vals, 4)
   }
   future::plan("future::multisession", workers = 4)
   r <- future.apply::future_lapply(vals,
                                    slow_func,
                                    future.scheduling = scheduling,
                                    future.chunk.size = NULL)
   future::plan("sequential")
   r <- unlist(r)
   return(r)
 }

Results

As you can see from the table below, all the parallel options are faster than using a single core, but the choice of method does matter. Using future_lapply is 54% faster than a simple parlappy, and future_lapply with zig-zag sorting and dynamic load balancing is pretty close to the optimal possible runtime of 3.2 minutes.

It is also clear that the base R functions parlappy and parlappyLB do not do as well as some of the newer packages. Zig-Zag sorting always improved performance, although the difference is minor when dynamic load balancing is also used.

FunctionZig-Zag SortedDynamic Load BalanceRun Time (min)Relative Runtime 
lapplyNoNo12.993.884Reference Singe Core Run
parlappyNoNo7.212.156 
parlappyYesNo5.581.667 
parlappyLBNoYes5.551.661 
future_lapplyNoNo5.531.663 
parlappyLBYesYes5.321.589 
future_lapplyYesNo4.061.221 
foreachNoNo3.511.048 
foreachYesNo3.501.047 
future_lapplyNoYes3.491.005 
foreachNoYes3.381.012 
foreachYesYes3.351.002 
future_lapplyYesYes3.321 

For your code, you will need to benchmark different load balacing options to see which works best for your use case. But hopefully this blog has at least shown how important it is to get load balancing right.

To leave a comment for the author, please follow the link and comment on their blog: R – Neural Discharge.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)