Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Norman Matloff

You've heard that graphics processing units — GPUs — can bring big increases in computational speed. While GPUs cannot speed up work in every application, the fact is that in many cases it can indeed provide very rapid computation. In this tutorial, we'll see how this is done, both in passive ways (you write only R), and in more direct ways, where you write C/C++ code and interface it to R.

## Using GPUs While Sticking to R

A number of R packages have been written to allow you to access some GPU operations while staying in the convenience and comfort of your own R living room, such as gputools, gmatrix and Rth. (The first two are available on CRAN, while the third is currently available from this GitHub site.) They perform only specialized operations, mainly matrix manipulation and sorting. Rth, written by Drew Schmidt and me, is an R interface to Thrust, a collection of templates to generate CUDA code, or optionally OpenMP code. In other words, the same code is usable on two different kinds of parallel platforms, GPU and multicore. As a quick example, I computed distances between rows of a matrix, first using R's built-in dist() function, and then using gpuDist() from the gputools package. For a 1000×1000 matrix, the GPU version took 0.258 seconds, compared to 3.671 for plain R. For 2500×2500, the results were even more dramatic, 3.220 seconds versus 103.219! Note, though, that I could not directly try 5000×5000, as I exceeded the memory size of my GPU; to handle the larger matrix, I'd need to do the work in stages, thus losing some of the speed advantage. GPUs are not panaceas.

## Writing CUDA Code

Here we'll have an example of how to write CUDA yourself. While giving a fully-detailed account would take more space than is appropriate for a blog, this example should give you a good overview of what is involved. For further details, see the partial draft of my forthcoming book, Parallel Computation for Data Science. We will again use the graph application in my OpenMP tutorial blog post, in this case measuring commonalities among inbound links rather than outlinks. Consider a large collection of Web sites. A 1 in row i, column j of the matrix means that site i links to site j. In comparing, for instance, sites 3 and 8, we compare columns 3 and 8, counting the number of rows with 1s in both columns. We do this for all possible pairs of sites, and compute the average inlink commonality. The code, downloadable for your convenience at this site, is as follows:


// MutIn.cu:  finds mean number of mutual
// inlinks, among all pairs of Web sites
// in our set; in checking (i,j) pairs,
// thread k will handle all i such that
// i mod totth = k, where totth is the

#include<cuda.h>
#include<stdlib.h>
#include<Rinternals.h>

// treat it as C code
extern "C" {
SEXP gpumutin(SEXP a, SEXP nb);
}

// GPU block size is hard coded as 192
#define BLOCKSIZE 192

// kernel:  processes all pairs assigned to
__global__ void procpairs(int *m,
int *tot, int n)
{
// total number of threads =
// number of blocks * block size
int totth = gridDim.x * BLOCKSIZE,
me = blockIdx.x * blockDim.x +
int i,j,k,sum = 0;
for (i = me; i < n; i += totth) {
for (j = i+1; j < n; j++) {
for (k = 0; k < n; k++)
sum += m[n*k+i] * m[n*k+j];
}
}
// atomically add my sum to the
// overall total tot
}

// find the mean number of mutual
// inlinks, over all pairs of columns

// SEXP is an internal R entity
// for storing an object; the
// INTEGER() function allows
// C/C++ code to access the data
// portions of the object

SEXP gpumutin(SEXP adj, SEXP nb) {
// how many GPU blocks?
int nblk = INTEGER(nb)[0];
// find n, number of rows
// the CPU is the "host,"
// the GPU the "device"
// get C-level view of adj on host
// set up a connection (ppinter
// to a device address to the
// space we will make for the
// device matrix
int *dm; // device matrix
// need space for totals in host, device
int htot, // host grand total
*dtot; // device grand total
int msize = n * n * sizeof(int);
// allocate space for device matrix
cudaMalloc((void **)&dm,msize);
// copy host matrix to device matrix
cudaMemcpy(dm,hm,msize,cudaMemcpyHostToDevice);
htot = 0;
// set up device total and initialize it
cudaMalloc((void **)&dtot,sizeof(int));
cudaMemcpy(dtot,&htot,sizeof(int),
cudaMemcpyHostToDevice);
// OK, ready to launch kernel,
// so configure grid
dim3 dimGrid(nblk,1);
dim3 dimBlock(BLOCKSIZE,1,1);
// launch the kernel
procpairs<<<dimGrid,
dimBlock>>>(dm,dtot,n);
// wait for kernel to finish
// copy total from device to host
cudaMemcpy(&htot,dtot,sizeof(int),
cudaMemcpyDeviceToHost);
// clean up at device
cudaFree(dm);
cudaFree(dtot);
// done; return the mean as an R object
return
ScalarReal(((double) htot) / (n*(n-1)/2));
}


This looks complicated, but there really are only a few key points:

• Much of the code, as seen in the comments, consists of allocating memory for our data on the GPU, and moving data back and forth between the host and device memories, such as

// allocate space for device matrix
cudaMalloc((void **)&dm,msize);
// copy host matrix to device matrix
cudaMemcpy(dm,hm,msize,
cudaMemcpyHostToDevice);

• Each SM is broken into blocks, of size specified by the programmer. All the threads within a block execute in lockstep. The programmer also specifies the grid size, which is the number of blocks.
• The programmer writes a kernel, designated by __global__. Each thread runs the kernel. The thread will determine its ID number from its block number, and thread number within block:
   // total number of threads =
// number of blocks * block size
int totth = gridDim.x * BLOCKSIZE,
me = blockIdx.x * blockDim.x +

• The kernel call, or launch, is in our case here the call to procpairs(). We not only supply arguments to the function, but also this is where we state the block and grid sizes. Note by the way that we have left the number of blocks up to the user, in the variable nblk:
    // OK, ready to launch kernel,
// so configure grid
dim3 dimGrid(nblk,1);
dim3 dimBlock(BLOCKSIZE,1,1);
// launch the kernel
procpairs<<<dimGrid,
dimBlock>>>(dm,dtot,n);

• As usual with shared-memory programming, one must avoid race conditions, where several threads may step on each other while writing to a shared variable. Suppose for instance that current x is 12, and threads 3 and 8 both want to add 1 to it. The result should be that x is 14, but it may be 13 if both do this at about the same time. (They both read x as 12, both add 1, and both write 13.) GPUs don't handle this as well as multicore, but here we use CUDA's atomicAdd() to safely update our overall total. Warning: it too is slow.

### Compiling and Running

Not surprisingly in view of the facts that we are running on two different hardware platforms at once, in a nonstandard extension of standard C/C++, compiling and running CUDA code requires some attention to detail, e.g. in specifying locations of libraries. For example, on many machines, including mine, CUDA resides in the directory /usr/local/cuda. I made sure to set the environment variables for my execution and library-load search paths:

export CUDA_HOME=/usr/local/cuda/
export path=( $CUDA_HOME/bin$path )
export LD_LIBRARY_PATH=\$CUDA_HOME/lib64


CUDA has its own compiler, nvcc, which I invoked on my source file MutIn.cu as follows:

nvcc -g -Xcompiler -fPIC -c MutIn.cu \
-arch=sm_11 -I/usr/include/R


I instructed nvcc to pass on to gcc a request for position-independent code, specified that I wanted CUDA machine code for hardware version 1.1 or greater, and included the system files from R. Also, I specified the -c option, meaning that I wanted only to generate machine code, MutIn.o, to be linked in later, not a full standalone executable. To then link in the R libraries and the CUDA runtime library, I ran

R CMD SHLIB MutIn.o -lcudart \
-L/usr/local/cuda/lib64


This produced a shared-object file MutIn.so, loadable from R! Here is a sample run:

> dyn.load("MutIn.so")
> n <- 2500
> m <- matrix(as.integer(sample(0:1,n^2,
replace=T)),ncol=n)
> .Call("gpumutin",m,as.integer(4))


I got run times of about 2.1 seconds for the CUDA code, versus about 31.6 for straight R code. For the latter, I reused AdjSnow.R from my OpenMP tutorial with one worker. This was a bit unfair for various reasons, but on the other hand, the CUDA code here is not optimized either. But no matter how you slice it, the GPU is definitely a lot faster for this application.

## CUDA Libraries

In many cases, such as our example above, one can attain very nice speedup with CUDA on one's first try, with no attempt at optimization. But if you want real speed, CUDA is one of the most tweakable platforms you'll ever encounter, due to the fact that you know an awful lot about what is going on inside. You know, for instance, the direct implication of having if/else code and of ordered array access. One can do this to a large extent with multicore too, but in that settings it is more on the level of guesswork. We can guess cache behavior in multicore (far more complex than in the uniprocessor case), but the GPUs have what amounts to a programmer-managed cache — YOU are in control. And the issue of block size and number of blocks, which can be key to good speed in CUDA, has no counterpart in multicore. An excellent example is the painstaking optimization of a very simple calculation, a pixel histogram. Paying careful attention to memory access patterns, the engineer was able to make very impressive speed gains. But there are two problems with this. First, there is the machine efficiency vs. human efficiency (your coding time) tradeoff. Second, the NVIDIA GPU series is constantly evolving; code that works on today's chips will still work on the future ones, but it won't be optimal. (In the code example and hardware descriptions above, we are assuming rather earlier chips in the model series.) One partial solution is to use CUDA libraries whenever possible. These exist for matrix ops (CUBLAS), Fast Fourier Transform (CUFFT), sorting/prefix scan (CUDPP) and so on. This code is already highly tweaked, and tends to keep up with the evolution of the chipset.

## Conclusions

For certain applications, GPUs can deliver excellent speedup. In some cases, this can be obtained with rather little effort on the programmer's part, by making use of libraries, including some offering direct R interfaces. Analysts with major computational needs will find that GPU will make an excellent addition to their toolkits.