Too Much Parallelism is as Bad

[This article was first published on R – Quintuitive, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The other day I run a machine learning backtest on a new data set. Once I got through the LDA and QDA initial run, I decided to try xgboost. The first thing I observed was a really bad performance. The results from the following debugging session were quite surprising to me.

I have been using the same framework for a few years now. I think there are some examples outlining the approach even on this blog, but I am lazy to dig them out now. Without going into further details, let me outline my “stack”:

stack

As I mentioned, I have been using this stack for a few years now, and during this time, I have seen some really slow models. Two factors got me suspicious in this case:

  • I was using a new method – yeah, my first attempt with xgboost.
  • The data set was rather small and simple.

What I found out was that there was too much parallelization happening. Somehow, all these threads and process were getting messed up, and, although there was progress, it was glacially slow.

Looking at the stack – the parallization is not that obvious. Certainly I was using multiple processes via the parallel package, but what else – I was seeing a lot more threads running. The culprit in this case was the default parallelization in xgboost. Nowadays apparently every layer is trying to exploit multiple cores, thus, that wasn’t surprising, just something new to me.

The fix ended up being quite simple – call caret’s train with nthread=1, which in turn is passed to xgb.train and solves the problem.

Looking at the stack above, I realized that, potentially, there might be other similar issues. For instance, Microsoft’s R Open provides some multi-threaded improvements via the Intel’s MKL library. In my case, that was not causing any observable problems, but in case it is – the threading can be disabled via:

setMKLthreads(1)

Now everything is up and running, and I am looking forward to the output.

The post Too Much Parallelism is as Bad appeared first on Quintuitive.

To leave a comment for the author, please follow the link and comment on their blog: R – Quintuitive.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)