Parallel Programming with GPUs and R

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Norman Matloff

You've heard that graphics processing units — GPUs — can bring big increases in computational speed. While GPUs cannot speed up work in every application, the fact is that in many cases it can indeed provide very rapid computation. In this tutorial, we'll see how this is done, both in passive ways (you write only R), and in more direct ways, where you write C/C++ code and interface it to R.

What Are GPUs?

A GPU is basically just a graphics card. Any PC or laptop has one (or at least a GPU chipset), but the more expensive ones, favored by the gamers, are quite fast. Because the graphical operations performed by a GPU involve vectors and such, at some point physicists started using GPUs for their own non-graphic computations. To do so, they had to “trick” the GPUs by writing what appeared to be graphics code. Such programming was, to say the least, unwieldy, so NVIDIA took the risky, “build it and they will come” step of designing its GPUs to be generally programmable. The goal was to have a structure that is fast for game programming but which is programmable in “normal” C/C++ code. NVIDIA's bet paid off, and today the most commonly-used GPUs for general-purpose work are made by that firm. They are programmed in a language called CUDA, which is an extension of C/C++. Check this list to see if you have a CUDA-compatible graphics card. We will focus on NVIDIA here, and the term GPU will mean those devices. GPUs are shared-memory, threaded devices. (If you haven't seen these terms before, please read my recent OpenMP tutorial blog post first, at least the beginning portions.) Note by the way that the GPU card has its own memory, separate from that accessed by your CPU. A GPU might be called a “multi-multicore” machine. It consists of a number of streaming multiprocessors (SMs), each one of which is a multicore engine. Thus the cores share memory, and many threads run simultaneously. A typical GPU will contain many more cores than the 2, 4 or even 8 core one sees commonly on PCs and laptops. There are some differences from standard multicore, though, notably that threads run in groups called blocks. Within a block, the threads run in lockstep, all running the same machine language instruction (or none at all) at the same time. This uniformity makes for very fast hardware if one's application can exploit it. Not all applications will fare well with GPUs. Consider what happens with an if type of statement in C/C++. The threads that satisfy the “if” condition will all execute the same machine instructions, but the remaining threads will be idle, a situation termed thread divergence. We thus lose parallelism, and speed is reduced. So, applications that have a regular pattern, such as matrix operations, can really take advantage of GPUs, while other apps that need a lot of if/else decision making may not be speedy. Another major way in which GPUs differ from ordinary multicore systems is that synchronization between threads is less flexible than in multicore. There is nothing in GPUs quite the same as what the OpenMP barrier construct gives us. Threads within a block are synchronized by definition, but not between blocks. If we need a “wait here until all threads get here” type of operation, one must resort to heavy-handed measures, such as returning control to the CPU of the host machine, very slow. So, the applications that are most suitable for GPU use are those not needing much in terms of barriers and the like. In any shared-memory system, a major performance issue is memory access time. With GPUs, the problem is writ large, since there is memory on two levels, host and device, each of which has long latency times. On the other hand, this is ameliorated for device memory, since there is an efficient embedded-in-hardware operating system that engages in latency hiding. : When a thread block encounters an instruction needing access to device memory, the OS suspends that block and tries to find another to run that doesn't access memory, giving the first block time to do the memory access while useful work is done by the second block. Accordingly, it is good practice to set up a large number of threads, each with a small amount of work to do (very different from the multicore case).The memory access itself is most efficient if the threads in a block are requesting consecutive memory locations; good CUDA coders strive to make this happen.

Using GPUs While Sticking to R

A number of R packages have been written to allow you to access some GPU operations while staying in the convenience and comfort of your own R living room, such as gputools, gmatrix and Rth. (The first two are available on CRAN, while the third is currently available from this GitHub site.) They perform only specialized operations, mainly matrix manipulation and sorting. Rth, written by Drew Schmidt and me, is an R interface to Thrust, a collection of templates to generate CUDA code, or optionally OpenMP code. In other words, the same code is usable on two different kinds of parallel platforms, GPU and multicore. As a quick example, I computed distances between rows of a matrix, first using R's built-in dist() function, and then using gpuDist() from the gputools package. For a 1000×1000 matrix, the GPU version took 0.258 seconds, compared to 3.671 for plain R. For 2500×2500, the results were even more dramatic, 3.220 seconds versus 103.219! Note, though, that I could not directly try 5000×5000, as I exceeded the memory size of my GPU; to handle the larger matrix, I'd need to do the work in stages, thus losing some of the speed advantage. GPUs are not panaceas.

Writing CUDA Code

Here we'll have an example of how to write CUDA yourself. While giving a fully-detailed account would take more space than is appropriate for a blog, this example should give you a good overview of what is involved. For further details, see the partial draft of my forthcoming book, Parallel Computation for Data Science. We will again use the graph application in my OpenMP tutorial blog post, in this case measuring commonalities among inbound links rather than outlinks. Consider a large collection of Web sites. A 1 in row i, column j of the matrix means that site i links to site j. In comparing, for instance, sites 3 and 8, we compare columns 3 and 8, counting the number of rows with 1s in both columns. We do this for all possible pairs of sites, and compute the average inlink commonality. The code, downloadable for your convenience at this site, is as follows:

 
// MutIn.cu:  finds mean number of mutual 
// inlinks, among all pairs of Web sites 
// in our set; in checking (i,j) pairs, 
// thread k will handle all i such that 
// i mod totth = k, where totth is the
//  number of threads

#include<cuda.h>
#include<stdlib.h>
#include<Rinternals.h>

// treat it as C code
extern "C" {
    SEXP gpumutin(SEXP a, SEXP nb);
}

// GPU block size is hard coded as 192
#define BLOCKSIZE 192

// kernel:  processes all pairs assigned to 
// a given thread
__global__ void procpairs(int *m, 
   int *tot, int n)
{
   // total number of threads = 
   // number of blocks * block size
   int totth = gridDim.x * BLOCKSIZE,
       // my thread number
       me = blockIdx.x * blockDim.x + 
          threadIdx.x;
   int i,j,k,sum = 0;
   for (i = me; i < n; i += totth) {  
      for (j = i+1; j < n; j++) {  
         for (k = 0; k < n; k++)
            sum += m[n*k+i] * m[n*k+j];
      }
   }
   // atomically add my sum to the 
   // overall total tot
   atomicAdd(tot,sum);
}

// find the mean number of mutual 
// inlinks, over all pairs of columns 
// of the graph adjacency matrix adj

// SEXP is an internal R entity 
// for storing an object; the 
// INTEGER() function allows 
// C/C++ code to access the data 
// portions of the object

SEXP gpumutin(SEXP adj, SEXP nb) {
    // how many GPU blocks?
    int nblk = INTEGER(nb)[0];
    // find n, number of rows 
    // and cols in adj
    SEXP adjdim = 
       getAttrib(adj,R_DimSymbol);
    int n = INTEGER(adjdim)[0];
    // the CPU is the "host," 
    // the GPU the "device"
    // get C-level view of adj on host
    int *hm = INTEGER(adj);
    // set up a connection (ppinter 
    // to a device address to the 
    // space we will make for the 
    // device matrix
    int *dm; // device matrix
    // need space for totals in host, device
    int htot, // host grand total
        *dtot; // device grand total
    int msize = n * n * sizeof(int);
    // allocate space for device matrix
    cudaMalloc((void **)&dm,msize);
    // copy host matrix to device matrix
    cudaMemcpy(dm,hm,msize,cudaMemcpyHostToDevice);
    htot = 0;
    // set up device total and initialize it
    cudaMalloc((void **)&dtot,sizeof(int));
    cudaMemcpy(dtot,&htot,sizeof(int),
       cudaMemcpyHostToDevice);
    // OK, ready to launch kernel, 
    // so configure grid
    dim3 dimGrid(nblk,1);
    dim3 dimBlock(BLOCKSIZE,1,1);
    // launch the kernel
    procpairs<<<dimGrid,
       dimBlock>>>(dm,dtot,n);
    // wait for kernel to finish
    cudaThreadSynchronize();
    // copy total from device to host
    cudaMemcpy(&htot,dtot,sizeof(int),
       cudaMemcpyDeviceToHost);
    // clean up at device
    cudaFree(dm);
    cudaFree(dtot);
    // done; return the mean as an R object
    return 
       ScalarReal(((double) htot) / (n*(n-1)/2));
}

This looks complicated, but there really are only a few key points:

    • Much of the code, as seen in the comments, consists of allocating memory for our data on the GPU, and moving data back and forth between the host and device memories, such as
    
    // allocate space for device matrix
    cudaMalloc((void **)&dm,msize);
    // copy host matrix to device matrix
    cudaMemcpy(dm,hm,msize,
       cudaMemcpyHostToDevice);
    • Each SM is broken into blocks, of size specified by the programmer. All the threads within a block execute in lockstep. The programmer also specifies the grid size, which is the number of blocks.
    • The programmer writes a kernel, designated by __global__. Each thread runs the kernel. The thread will determine its ID number from its block number, and thread number within block:
   // total number of threads =
   // number of blocks * block size
   int totth = gridDim.x * BLOCKSIZE,
       // my thread number
       me = blockIdx.x * blockDim.x +
          threadIdx.x;
    • The kernel call, or launch, is in our case here the call to procpairs(). We not only supply arguments to the function, but also this is where we state the block and grid sizes. Note by the way that we have left the number of blocks up to the user, in the variable nblk:
    // OK, ready to launch kernel,
    // so configure grid
    dim3 dimGrid(nblk,1);
    dim3 dimBlock(BLOCKSIZE,1,1);
    // launch the kernel
    procpairs<<<dimGrid,
       dimBlock>>>(dm,dtot,n);
  • As usual with shared-memory programming, one must avoid race conditions, where several threads may step on each other while writing to a shared variable. Suppose for instance that current x is 12, and threads 3 and 8 both want to add 1 to it. The result should be that x is 14, but it may be 13 if both do this at about the same time. (They both read x as 12, both add 1, and both write 13.) GPUs don't handle this as well as multicore, but here we use CUDA's atomicAdd() to safely update our overall total. Warning: it too is slow.

Compiling and Running

Not surprisingly in view of the facts that we are running on two different hardware platforms at once, in a nonstandard extension of standard C/C++, compiling and running CUDA code requires some attention to detail, e.g. in specifying locations of libraries. For example, on many machines, including mine, CUDA resides in the directory /usr/local/cuda. I made sure to set the environment variables for my execution and library-load search paths:

export CUDA_HOME=/usr/local/cuda/
export path=( $CUDA_HOME/bin $path )
export LD_LIBRARY_PATH=$CUDA_HOME/lib64

CUDA has its own compiler, nvcc, which I invoked on my source file MutIn.cu as follows:

nvcc -g -Xcompiler -fPIC -c MutIn.cu \
   -arch=sm_11 -I/usr/include/R

I instructed nvcc to pass on to gcc a request for position-independent code, specified that I wanted CUDA machine code for hardware version 1.1 or greater, and included the system files from R. Also, I specified the -c option, meaning that I wanted only to generate machine code, MutIn.o, to be linked in later, not a full standalone executable. To then link in the R libraries and the CUDA runtime library, I ran

R CMD SHLIB MutIn.o -lcudart \
   -L/usr/local/cuda/lib64

This produced a shared-object file MutIn.so, loadable from R! Here is a sample run:

> dyn.load("MutIn.so")
> n <- 2500
> m <- matrix(as.integer(sample(0:1,n^2,
   replace=T)),ncol=n)
> .Call("gpumutin",m,as.integer(4))

I got run times of about 2.1 seconds for the CUDA code, versus about 31.6 for straight R code. For the latter, I reused AdjSnow.R from my OpenMP tutorial with one worker. This was a bit unfair for various reasons, but on the other hand, the CUDA code here is not optimized either. But no matter how you slice it, the GPU is definitely a lot faster for this application.

CUDA Libraries

In many cases, such as our example above, one can attain very nice speedup with CUDA on one's first try, with no attempt at optimization. But if you want real speed, CUDA is one of the most tweakable platforms you'll ever encounter, due to the fact that you know an awful lot about what is going on inside. You know, for instance, the direct implication of having if/else code and of ordered array access. One can do this to a large extent with multicore too, but in that settings it is more on the level of guesswork. We can guess cache behavior in multicore (far more complex than in the uniprocessor case), but the GPUs have what amounts to a programmer-managed cache — YOU are in control. And the issue of block size and number of blocks, which can be key to good speed in CUDA, has no counterpart in multicore. An excellent example is the painstaking optimization of a very simple calculation, a pixel histogram. Paying careful attention to memory access patterns, the engineer was able to make very impressive speed gains. But there are two problems with this. First, there is the machine efficiency vs. human efficiency (your coding time) tradeoff. Second, the NVIDIA GPU series is constantly evolving; code that works on today's chips will still work on the future ones, but it won't be optimal. (In the code example and hardware descriptions above, we are assuming rather earlier chips in the model series.) One partial solution is to use CUDA libraries whenever possible. These exist for matrix ops (CUBLAS), Fast Fourier Transform (CUFFT), sorting/prefix scan (CUDPP) and so on. This code is already highly tweaked, and tends to keep up with the evolution of the chipset.

Conclusions

For certain applications, GPUs can deliver excellent speedup. In some cases, this can be obtained with rather little effort on the programmer's part, by making use of libraries, including some offering direct R interfaces. Analysts with major computational needs will find that GPU will make an excellent addition to their toolkits.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)