The Numerical Template Toolbox (NT2)
collection of header-only C++ libraries that make it
possible to explicitly request the use of SIMD instructions
when possible, while falling back to regular scalar
operations when not. NT2 itself is powered
by Boost, alongside two proposed
Boost libraries –
Boost.Dispatch, which provides a
mechanism for efficient tag-based dispatch for functions,
Boost.SIMD, which provides a framework for the
implementation of algorithms that take advantage of SIMD
and exposes these libraries for use with
If you haven’t already, read the
RcppNT2 introduction article
to get acquainted with the RcppNT2 package.
the (infamous) buzzword that describes a class of
problems that can be solved by splitting an an algorithm
into a map (transform) step, and a reduction step. Although
this scheme is typically adopted to help solve problems
‘at scale’ (e.g., with a large number of communicating
machines), it is also a useful abstraction for many
problems in the SIMD universe.
Take, for example, the dot product. This can be expressed
in R code simply as:
sum(lhs * rhs)
Here, we ‘map’ our vectors by multiplying them together
element-wise, and ‘reduce’ the result through summation.
Of course, behind the scenes, R is doing a bit more than
it has to – it’s computing a new vector,
lhs * rhs,
which is of the same length as
lhs, and then collapsing
(reducing) that vector by adding each element up. It
would be great if we could skip that large temporary
vector allocation. RcppNT2 provides a function,
simdMapReduce(), that makes expressing these kinds of
problems very easy.
To make use of
simdMapReduce(), you need to write a
class that provides a number of templated methods:
U init()— returns the initial (scalar) data state.
T map(const T&... ts)— transforms the values
T combine(const T& lhs, const T& rhs)— describes how results should be combined
U reduce(const T& t)— describes how a SIMD pack should be reduced
You’ll notice that we play a little fast-and-loose with
the terms, but it should still be relatively clear what
each method accomplishes. With this infrastructure, the
dot product could be implemented like so:
If you can ignore the C++ templates, it should hopefully
be fairly clear what’s going on here. We transform
elements by multiplying them together, and we combine +
reduce by adding them up. (Unfortunately, although the
reduce() functions are effectively
doing the same thing, they need to be expressed separately,
reduce() function is effectively our bridge from
SIMD land to scalar land).
Now, let’s show how our map-reducer can be called.
Let’s also export a version that accepts IntegerVector,
just to show that our class is generic enough to accept
other integral types as well.
And let’s execute it from R, just to convince ourselves that it works.
Great! Of course, a large number of problems can be
expressed with a ‘plus’, or ‘sum’ reduction, so RcppNT2
also provides a helper for that, so that you only need to
implement the ‘map’ step. We can do this by writing a class
that inherits from the
And, let’s convince ourselves it works:
And let’s use a quick microbenchmark to see if we’ve truly
gained anything here:
test replications elapsed relative 2 n1 %*% n2 100 0.412 5.282 3 simdDot(n1, n2) 100 0.085 1.090 4 simdDotV2(n1, n2) 100 0.078 1.000 1 sum(n1 * n2) 100 0.307 3.936
test replications elapsed relative 2 i1 %*% i2 100 1.303 42.032 3 simdDotInt(i1, i2) 100 0.031 1.000 1 sum(i1 * i2) 100 0.571 18.419
You might be surprised how profound the speed
improvements accrued by using SIMD instructions are. How
does this happen?
Behind the scenes,
simdMapReduce() is handling a number
of things for us:
Iteration over the sequences used SIMD packs when
possible, and scalars when not,
Optimized SIMD instructions are used to transform and
combine packs of values,
Intermediate results are held in a SIMD register, rather
than materializing a whole vector,
The SIMD register and scalar buffer are not combined
until the very final step.
In the ‘double’ case, we can pack 2 values into a SIMD
pack; in the ‘int’ case, we can pack 4 values (assuming
32bit ‘int’ and 128bit SSE registers, which is the common
case on Intel processors at the time of this post).
Assuming that it takes the number of clock cycles to
execute a SIMD instruction as it does for the scalar
equivalent, this should translate into ~2x and ~4x
speedups – and that’s not even accounting for gains in
efficient register use, cache efficiency, and the ability
to avoid the large temporary allocation! That said, we
are playing it a little fast and loose in the ‘int’ case:
with larger numbers, we could easily overflow; depending
on the type of data expected it may be more appropriate
to accumulate values into a different data type.
In short – if you’re implementing an algorithm, or part
of an algorithm, that can be expressed as:
simdMapReduce() is worth looking at.
This article provides just a taste of how RcppNT2 can be used.
If you’re interested in learning more, please check out the