Load Balanced Parallelization with snowfall

March 5, 2013
By

(This article was first published on UEB Blog. Musings on R, and kindly contributed to R-bloggers)

For some reason, I didn't notice a few months ago the best way to perform a parallelized version of Lapply with package snowfall.

We implemented the parallel version of function lapply with the function sfLapply, in the development of our pipeline prototype for Exome Variant Analysis ( https://launchpad.net/eva ).

However, I've just read the nice tutorial from Knaus & Porzelius (2009), in which he shows a nice diagram to clarify why sfClusterApplyLB can be better to have a load balanced version of your own code:

Click to enlarge
Click to enlarge


Therefore, we changed the critical line, easily, from :

# ...
  start3 <- Sys.time(); result2 <- sfLapply(1:length(params$file_list), wrapper2.parallelizable.per.sample) ; duration <- Sys.time()-start3;

# ...


to:

# ...
  start3 <- Sys.time(); result2 <- sfClusterApplyLB(1:length(params$file_list), wrapper2.parallelizable.per.sample) ; duration <- Sys.time()-start3;

# ...


(as you can see, we are parallelizing here per samples, not per processes within each sample; one thing at a time, since we only have a few spare cpus in our servers and we are not running the process in a real cluster yet)

With our test datasets, we cannot notice any great difference (a couple of small files for debugging purposes), but we'll be glad to check the potential improvement (let's hope so) with real case scenarios in short, in which some samples are way bigger than some other ones...

In my todo list there is a new entry related to the other interesting function called "sfClusterApplySR", explained also in the standard vignettes from snowfall:

Quote:
Another helpful function for long running clusters is sfClusterApplySR, which saves intermediate results after processing n-indices (where n is the amount of CPUs). If it is likely you have to interrupt your program (probably because of server maintenance) you can start using sfClusterApplySR and restart your program without the results produced up to the shutdown time.


And we hope to find some time in the following months to test a similar parallelization process with the "parallel" package (even if I have no clue yet whether there is any equivalent approach for load-balanced parallelization).

Some day...

To leave a comment for the author, please follow the link and comment on his blog: UEB Blog. Musings on R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.