Fast parallel computing with Intel Phi coprocessors

May 19, 2015

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Andrew Ekstrom
Recovering physicist, applied mathematician and graduate student in applied Stats and systems engineering

We know that R is a great system for performing statistical analysis. The price is quite nice too 😉 . As a graduate student, I need a cheap replacement for Matlab and/or Maple. Well, R can do that too. I’m running a large program that benefits from parallel processing. RRO 8.0.2 with the MKL works exceedingly well.

For a project I am working on, I need to generate a really large matrix (10,000×10,000) and raise it to really high powers (like 10^17). This is part of my effort to model chemical kinetics reactions, specifically polymers. I’m using a Markov Matrix of 5,000×5,000 and now 10,000×10,000 to simulate polymer chain growth at femptosecond timescales.

At the beginning of this winter semester, I used Maple 18 originally. I was running my program on a Windows 7 Pro computer using an intel I7 – 3700K (3.5GHz) quad core processor with 32GB of DDR3 ram. My full program took, well, WWWWWAAAAAAAYYYYYYYY TTTTTTTTOOOOOOOO LLLLLOOOONNNNGGGGGG!!!!!!!!

After a week, my computer would still be running. I also noticed that my computer would use 12% -13% of the processor power. With that in mind, I went to the local computer parts superstore and consulted with the sales staff. I ended up getting a “Gamer” rig when I purchased a new AMD FX9590 processor (4.7GHz on 8 cores) and dropped it into a new mobo. This new computer ran the same Maple program with slightly better results. It took 4-5 days to complete… assuming no one else used the computer and turned it off.

After searching for a better method (meaning better software) for running my program, I decided to try R. After looking around for a few hours, I was able to rewrite my program using R. YEAH! Using the basic R (version 3.1.2), my new program only took a few days (2-3).  A nice feature of R is an improved BLAS and LAPACK and their implementation in R over Maple 18. Even though R 3.1.2 is faster than Maple 18, R only used 12%-13% of my processor. 

Why do I keep bringing up the 12%-13% CPU usage?  Well, it means that on my 8 core processor, only 1 core is doing all the work. (1/8 = 0.125) Imagine you go out and buy a new car. This car has a big V8 engine but, only 1 cylinder runs at a time. Even though you have 7 other cylinders in the car, they are NOT used. If that was your car, you would be furious. For a computer program, this is standard protocol. A cure for this type of silliness is to use parallel programming.   

Unfortunately, I AM NOT A PROGRAMMER! I make things happen with a minimal amount of typing. I’m very likely to use “default settings” because I’m likely to mistype something and spend an hour trying to figure out, “Is that a colon or a semi colon?” So when I looked around at other websites discussing how to compile and/or install different blas and lapack for R, I started thinking, “I wish I was taking QED right now. (QED = Quantum Electro-Dynamics)” I also use Windows, most of the websites I saw discussed doing this in Linux.

That led me to Revolution Analytics RRO. I installed RRO version 8.0.2 and the MKL available from here:

RRO uses Intel’s Math Kernel Library, which is updated and upgraded to run certain types of calculations in parallel. Yes, parallel processing in Windows, which is step one of HPC (High Performance Computing) and something many of my comp sci friends and faculty said was difficult to do.

A big part of my project is raising a matrix to a power. This is a highly parallelizable process.  By that I mean, calculating element A(n,n) in the new matrix does not depend upon the value of A(x,x) in the new matrix. They only care about what is in the old matrix. Using the old style (series) computing, you calculate A(1,1), then A(1,2), A(1,3) … A(n,n). With parallel programming, on my 8 core AMD processor, I can calculate A(1,1), A(1,2), A(1,3) … A(1,8) at the same time. If these calculations were “perfectly parallel” I would get my results 8 times faster. For those of us that have read other blog posts on, you know that the speed boost for parallel programming is great, but not perfect. (Almost like it follows the laws of thermodynamics.) By using RRO, I was able to run my program in R and get results for all of my calculations in 6-8 hours. That got me thinking.

If parallel processing on 8 cores instead of series processing on 1 core is a major step up, can I boost the parallel processing possibility? Yes. GPU processors like the Tesla and FirePro are nice and all but:

1)      Using them with R requires programming and using Linux. Two things I don’t have time to do.

2)      Entry level Tesla and Good Firepro GPUs cost a lot of money. Something I don’t have a lot of right now.

The other option is using an Intel Phi coprocessor, or two. Fortunately, when I started looking, I could pick up a Phi coprocessor for cheap. Like $155 cheap for a brand new coprocessor from an authorized retailer. The video card in my computer cost more than my 2 Phi’s. The big issue, is getting a motherboard that has the ability to handle the Phi’s. Phi coprocessors have 6+GB of ram. Most mobo’s can’t handle more than 4GB of ram through a PCI-E 3.0 slot. So, I bought a second mobo as a “hobby” project computer. This new mobo is intended for “workstations” and has 4 PCI-E 3.0 slots. That gives me enough room for a good video card and 2 Phi’s. This new Workstation PC has an Intel Xeon E5-2620V3 (2.4GHz 6-core, 12-Thread) processor, 2 Intel Xeon Phi coprocessors 31S1P (57 cores with 4 threads per core at 1.1GHz per thread for a total of 456threads) and 48Gb DDR4 Ram. 

The Intel Phi coprocessors work well with the Intel MKL. The same MKL RRO uses. Which means, if I use RRO with my Phi’s, after they are properly set up, I should be good to go….. Intel doesn’t make this easy. (I cobbled together the information from 6-7 different sources. Each source had a small piece of the puzzle.) The Phi’s are definitely not “Plug and Play”. I used MPSS version 3.4 for Windows 7. I downloaded the drivers from here:

I had to go into the command prompt and follow some of the directions available here. (Helpful hint, use micinfo to check your Phi coprocessors after step 9 in section 2.2.3 “Updating the Flash”.)

After many emails to Revolution Analytics staff, I was able to get the Phi’s up and running! Now, my Phi’s work harmoniously with MKL.   Most of the information I needed is available here.

In the paper and website above, I needed to create some environmental variables. The generic ones are:


Since I have 2 Phi coprocessors, my is 0, 1.(At least this is the list that worked.) I set MKL_MIC_MAX_MEMORY to 8GB. ( I have the ram to do it, so why not.) MIC_OMP_NUM_THREADS = 456.

Below, is a sample program I used to benchmark Maple 2015, R and RRO on my Gamer computer and my Workstation. Between the time I started this project and now, Maple up graded their program to Maple 2015. The big breakthrough is that Maple now does parallel processing. So, I ran the program below using Maple 2015 to see how it compares to R and RRO. (I uninstalled Maple 18 in anger.)  I also ran the same program on my Workstation PC to see how well the Phi coprocessors worked. Once I had everything enabled, I didn’t want to disable anything. So, I just have the one, VERY IMPRESSIVE, time for my workstation.

for ( i in 1:a){c[i,i] = 1-1.75*b}
for ( i in 1:a){c[i-1,i] = b}
for ( i in 2:a){c[i,i-1] = 0.75*b}

Power of parallel revised

By using RRO instead of R, I got my results 3.12 hours faster. Considering the fact that I have several dozen more calcs like this one, saving 3hrs per calc is wonderful 😉 By using RRO instead of Maple 2015, I saved about 41 mins. By using RRO with the Phi’s on my Workstation PC, I was done in 187.3s. I saved an additional 39 mins over my Gamer Computer! When I ran my full program, it took under an hour. Compared to the days/weeks for my smaller calculations, an hour is awesome!

An interesting note on the InteL MKL. It only uses cores, not threads, on the main processor. I’m not sure how it handles the threads on the Phi coprocessors. So, my Intel Xeon processor only had 50% usage of the main processor.

Now, your big question is, “Why should I care?” I ran a 10,000×10,000 matrix and raised it to unbelievably high values. I used a brute force method to do it. Suppose that you are doing “Big Data” analysis and you have 30 columns by 2,000,000 rows. If you run a linear regression on that data, your software will use a Pseudoinverse to calculate the coefficients of your regression. A part of the pseudoinverse involves multiplying your 30×2,000,000 matrix by a 2,000,000×30 matrix and it’s all parallelizable! Squaring my matrix uses about 1.00×1012 operations (assuming I have my Big O calculation correct.) The pseudo inverse of your matrix uses a mere 1.80×109 operations.  

Some of my friends who do these sort of “Big Data” calculations using the series method built into basic R or SAS tell me that they take hours(1-2) to complete. With my workstation, I have the computational power of 17 servers that use my same Xeon processor. That calculation would take me way less than a minute.

Behold, the power of parallel processing!

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)