R has featured packages to support GPU programming for over five years.
Beginning with the Gputools package, developers continue to introduce new and
more sophisticated tools to take advantage of these powerful coprocessors. The
gmatrix package is a recent continuation of this trend which
offers some significant new features.
GPUs, or graphical processing units, are the numerical engines powering the
visual display of modern computers. They are a type of coprocessor, or
computational unit distinct from the central processing unit, or CPU. When
first introduced GPUs were discrete components, or cards. While powerful and
memory-rich discrete cards are still popular, many processors now ship with
GPUs contained in the same physical hardware (“die”) as the CPU. Whether
discrete or resident on-die, though, the GPU offers high numerical performance,
owing to the parallel computation at which it excels.
GPUs perform the computation necessary, for example, to render high quality
three-dimensional moving images. While their success has largely been driven by
the gaming industry, these coprocessors have come into their own as
indispensible hardware for parallel computation. As numerical modellers
discovered over a decade ago, the GPU’s vector processing capabilities can
offer significant speedups for a wide variety of computational problems.
Programming the GPU can require specialized skills, but tools continue to be
introduced to make their use more appealing. Although there are numer- ous GPU
vendors and licensors – Nvidia, AMD, Qualcomm and Imagination Technologies, to
name a few – there are two main families of programming tools: CUDA and OpenCL.
CUDA is Nvidia’s proprietary software, tailored for its own products, while
OpenCL is an open standard, intended to be vendor- agnostic. It is not
necessary to be familiar with either family, or with parallel programming, to
exploit the GPU’s power.
Coprocessors and R
Successive releases of R, and packages, have offerred increasingly improved
support for parallel execution, both at the multicore and multiprocessor
levels. Coprocessors, however, continue to represent a specialized niche. This
may be due in part not only to their relatively arcane programming models, but
also to the difficulty in knowing just when to use them. Moreover, while the
advantages of high-memory, discrete GPUs are relatively well understood, on-die
GPUs continue to evolve rapidly, and their ultimate utility as general-purpose
coprocessors remains to be seen.
Some GPU-based R packages offer tuned versions of well-known operations, for
example lm and matrix multiplication. The coprocessor implementations may
merely be wrappers for OpenCL or CUDA library calls, they may be whole- cloth
implementations, or they may be a combination of the two. In any case, the user
merely invokes a command, and some or all of the computation is performed on
the GPU, with the result sent back to the CPU.
While the GPU can accelerate such operations ten- or even hundred-fold, the
actual advantage varies with characteristics of both the operands involved and
the underlying hardware. There is no way to know in advance where the tradeoff
lies, i.e., whether execution on the CPU or the GPU will be the better choice.
Even when the tradeoff can be roughly estimated, there is no general dispatch
mechanism to assign the operation to one piece of hardware versus another. A
user seeking the best performance must typically make the choice explicitly,
based on empirical evidence. In the case of numerical linear algebra, at least,
the HiPLARM packages offers a solution, employing autotuned
software capable of characterizing these tradeoffs on the resident hardware and
dispatching the appropriate version of the command.
There are also packages which feature specialized solutions or workflows
optimized for execution on the GPU. These include, for example,
CudaBayesReg and WideLM. These packages were
written to accelerate specific types of problems. Dispatch is less of an issue
for such packages, as the user is essentially buying in to the higher-level
functionality of the software.
Two packages, RCuda and OpenCL, offer the
user API-oriented interfaces to low-level GPU programming. As their respective
names imply, these packages expose features of CUDA and OpenCL directly to the
user. Thus developers of low-level GPU code can now work within the R
The Rth package, while vendor-specific, has a different
goal. Here the user is provided with wrappers to useful functions implemented
with Thrust, a computational package built on top of CUDA. As Thrust remains
under active development, the Rth package is able to leverage continuing
improvements in high-performance GPU implementations.
The cost of communication
For discrete coprocessors at least, communication speed is the key overhead
to performance. This is because the path travelled by the data when moving to
or from the coprocessor is many times longer than when moving to or from the
processor. As just suggested, a well-trained dispatch mechanism can compensate
for communication costs by invoking the appropriate version (i.e., CPU or GPU),
at least for certain well-understood computations. A major limitation of
current dispatch-based approaches, though, is the assumption that the ultimate
destination of the computation is the CPU. This may not always be the case.
Iteratively-reweighted least squares is an accessible example of why this
assumption need not hold. Each iteration of the algorithm involves operations
which can be performed on either of the CPU or the GPU. While these operations
might individually be faster on one or the other type of hardware, it
is more likely that their aggregate will execute faster on the GPU, as
communication costs need only be paid for the first operation, following which
all intermediate results reside on the GPU. Clearly, if the user is given the
ability to specify that a computation take place on the GPU, and that the
result need not be sent back, the potential is enhanced for accelerating more
The gmatrix package
The gmatrix package, by Nathan Morris, offers a
straightforward solution. By leveraging R’s type system, the package offers new
classes, the gvector and eponymous gmatrix, for which GPU
storage and execution is the default. As an S4 class, for example, the
gmatrix is similar in spirit to Matrix, a derived class with
specialized features. To perform a supported operation, such as transpose, on
the GPU there is no need to invoke a function with a contrived name, such as
gpuTranspose(). One instead just uses t() and, assuming that the object has
type gmatrix, both the tranpose operation and its result reside on the
This is more than just an exercise in namespace binding. The type system is
doing the heavy lifting here, dispatching commands based on class. It remains
the responsibility of the user to determine where the work is best performed.
Creating “g-variant” objects is an elegant way to go about it. A large
collection of matrix and vector operations are supplied, similar to what is
available from other packages, along with additional features such as GPU-based
generation of common distributions. Some special syntactic sugar is needed when
constructing new objects directly on the GPU, but transfer between host and GPU
are as straightforward as invoking commands h() and g(). Once
objects are appropriately typed, again, S4 does most of the work.
gmatrix is built atop CUDA and much of the GPU
functionality derives from a proprietary BLAS implementation, cuBLAS, as well
as from Thrust. Although the low-level back end uses a proprietary
implementation, however, the overall programming model used by
gmatrix is essentially vendor-agnostic. That is, although the
user must contend with what operations to call and whether the data resides on
the host or the GPU, the under-the-hood implementation is completely hidden. In
particular, the back end could be re-engineered for some other API, leaving the
look-and-feel of the package unchanged. This is one reason that
gmatrix could have some staying power.
Other authors have noted the possibility of using the type system to
dispatch hardware-variant commands to achieve high performance. Christopher
Brown reported early work using OpenCL at Nvidia’s annual GTC in 2010, for
example. The pdbR series of packages appear to use similar
ideas for large clusters. gmatrix, though, is the first
fully-implemented package to exploit this idea successfully for the GPU.
In several presentations of the parmatpows matrix
exponentiation package, Norm Matloff discusses speedups offered by using
gmatrix to provide core linear algebra functionality on the
GPU. This includes his 2014 talk at UseR!
There are a variety of packages available to the R programmer to facilitate
high- performance development on the GPU. Their functionality ranges from
low-level interface to highly tuned command-level and workflow-level
implementations. The gmatrix package allows the user to guide
optimization of an algorithm by specifying where the data is to reside at any
step within a computation. Because of its reliance on object typing, as opposed
to syntax, it does so at minimal imposition to the user. The package is a
welcome addition to the toolset, and should be seriously considered by anyone
wishing to accelerate their R work using a GPU.