A few days ago, I wrote here about how some researchers, such Art Owen and Katelyn Gao at Stanford and Patrick Perry at NYU, have been using an old, old statistical technique — random effects models — for a new, new application — recommender systems. In addition to describing their approach to that problem, I also used this setting as an example of my partools package, in particular the parallel computation method I call Software Alchemy (SA). I illustrated SA on some movie ratings data, and got a nice speedup of the lmer() function in the package lme4. The run time for SA was a little over 3 minutes, compared to nearing 20 minutes without.
However, yesterday Doug Bates, one of the developers of lme4, wrote a comment on my blog, reporting that the run time for the same problem on his desktop machine was about 3 minutes, compared to my 20. This was quite intriguing, and I replied that I would investigate. I also mentioned that I would make a new post out of the matter, as I suspected it would be of general interest. Hence the post you are now reading.
I re-ran my code on the same machine I had used earlier, a PC running Linux in our student lab. I’ll skip the specs here, but suffice it to say it is an ordinary machine, about a year old. At any rate, I got basically the same run times as before.
It then occurred to me that Doug would probably have installed a fancy BLAS — the Basic Linear Algebra Subroutines that run R’s matrix ops — on his machine, whereas our student machine would just be running stock R, and therein lay the source of the big time discrepancy. You may recall my posting here on OpenBLAS, one of the fast alternatives to the BLAS version that comes with R.
So, I switched to a machine on which I had installed OpenBLAS. This machine actually is slower than the other one in terms of clock speed, but it does have 16 cores, and a hyperthreading degree of 2, so that 32 threads might profitably run simultaneously. (The first machine was quadcore with hyperthreading degree 2, but as noted, I didn’t have OpenBLAS installed there.) So, this second machine would make a fine testbed for assessing the performance of lme4 under OpenBLAS.
Sure enough, on this nominally slower machine, the problem ran in only about 8 minutes, not 20. And when I ran the top command from another shell window, I saw via the “% CPU” column that indeed many cores were at work, with the number fluctuating in the range 10-32. Remember, running k cores doesn’t necessarily mean a speedup of k, and often we get much less than that, but you can see that running a good BLAS can work wonders for the speed.
Note that, as mentioned briefly in my last post, SA can achieve superlinear speedup in some instances — MORE than speedup k for k cores. As explained in my book, this occurs when the time complexity for n data points is more than O(n). In my lme4 example, there was about a 6X speedup for just 4 cores.