Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Seth Mottaghinejad, Analytic Consultant for Revolution Analytics

In the last article, we showed two separate R implementations of the Collatz conjecture: 'nonvec_collatz' and 'vec_collatz', with the latter being more efficient than the former because of the way it takes advantage of vectorization in R. Let's once again take a look at 'vec_collatz':

```vec_collatz <- function(ints) {
+ niter <- integer(length(ints))
+ while (abs(sum(ints - 1)) > .01) {
+ niter <- niter + ifelse(ints == 1, 0, 1)
+ ints <- ifelse(ints == 1, ints, ifelse(ints %% 2 == 0, ints / 2, 3*ints + 1))
+ }
+
+ niter
+ }```

Today we will learn a thrid, and far more efficient way of implementing the Collatz conjecture. It involves rewriting the function in C++ and using the 'Rcpp' package in R to compile and run the function without ever leaving the R environment.

`library("Rcpp")`

One important difference between R and C++ is that when you write a C++ function, you need to declare your variable types. The C++ code chunk shown below creates a function called 'cpp_collatz' which takes an input of type 'IntegerVector' and whose output is of type 'IntegerVector'. Unlike in R, where explicit loops can slow your code down, loops in C++ are usually very efficient, even though they are tedious to write.

```cpptxt <- '
+ IntegerVector cpp_collatz(IntegerVector ints) {
+ IntegerVector iters(ints.size());
+ for (int i=0; i<ints.size(); i++) {
+ int nn = ints(i);
+ while (nn != 1) {
+ if (nn % 2 == 0) nn /= 2;
+ else nn = 3 * nn + 1;
+ iters(i) += 1;
+ }
+ }
+ return iters;
+ }'

cpp_collatz <- cppFunction(cpptxt)
set.seed(20)
cpp_collatz(sample(20))
[1] 20 17 8 19 4 20 1 0 2 5 3 16 9 7 17 7 12 6 14 9```
```set.seed(20)
a <- vec_collatz(sample(2000))
set.seed(20)
b <- cpp_collatz(sample(2000))
all.equal(a, b) # should return TRUE
[1] TRUE```

Let's now redo our C++ implementation in a slightly different way. We would rather not have C++ code interspersed with R code: Not only does it make it hard to read, but we also won't be able to take advantage of syntax-hightlighting specific to C++ (among other annoyances). So let's store the C++ code in a file we call 'collatz.cpp' and use the 'sourceCpp' function in R to call it. Here is the content of 'collatz.cpp':

```cat(paste(readLines(file("collatz.cpp")), collapse = "\n"))

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
int collatz(int nn) {
int ii = 0;
while (nn != 1) {
if (nn % 2 == 0) nn /= 2;
else nn = 3 * nn + 1;
ii += 1;
}
return ii;
}

// [[Rcpp::export]]
IntegerVector cpp_collatz(IntegerVector ints) {
IntegerVector iters(ints.size());
for (int i=0; i<ints.size(); i++) {
iters(i) = collatz(ints(i));
}
return iters;
}

// [[Rcpp::export]]
IntegerVector sug_collatz(IntegerVector ints) {
return sapply(ints, collatz);
}```

There are three things worth mentioning about the above code chunk:

1. We broke up the function in two: one function 'collatz' that operates on a single integer and returns its stopping time, and another function 'cpp_collatz' that runs the 'collatz' function on a vector of integers and returns a vector of stopping times. This makes the code easier to read (no more nested loops) and more modular.
2. There are two extra lines at the top, which tell C++ what namespace to use, as well as an extra line '// [[Rcpp::export]]' before each function definition. The latter is used to let Rcpp know that theses functions should be made available to R once they have been compiled.
3. The third function, called 'sug_collatz', does the same thing as 'cpp_collatz' but uses the Rcpp-sugar extension, which is an attempt to take common R functions and rewrite them in C++ so the C++ code can look like its R counterpart when possible. This can sometimes come at a very small cost but it saves us a lot of hassle, as you can see.

To compile the C++ code, just type the following in R:

sourceCpp("collatz.cpp")

Assuming that we have a C++ compiler istalled, it will take a few seconds to run. We can now type the following into the R console:

```cpp_collatz
function (ints)
.Primitive(".Call")(<pointer: 0x000000006e042180>, ints)

cpp_collatz(1:10) # seems to be working fine.
[1] 0 1 7 2 5 8 16 3 19 6

sug_collatz(1:10) # same as above.
[1] 0 1 7 2 5 8 16 3 19 6```

And now let's get to the reason we bothered with C++ in the first place: efficiency. There are four comparisons we're interested in:

1. the 'cpp_collatz' function
2. the 'sug_collatz' function, which uses 'sapply' in C++ instead of an explicit loop
3. the 'vec_collatz' function we wrote earlier which is our R implementation
4. the 'collatz' function we wrote earlier in C++, which works only on a single integer, but we can vectorize it by wrapping it in 'sapply' in R.

We can think of the last case as being a hybrid approach. One reason to include it is because someone may share a complicated piece of C++ code that works but is not vectorized and vectorizing it may turn out to be a non-trivial task (programmers are supposed to be lazy after all!).

```collatz_benchmark <- function(nums, ...) {
+ require(rbenchmark)
+ benchmark(
+ cpp_collatz(nums), # runs completely in C++
+ sug_collatz(nums), # runs completely in C++
+ vec_collatz(nums), # runs completely in R in a vectorized fashion
+ sapply(nums, collatz), # runs collatz in C++ but sapply in R
+ columns = c("test", "replications", "elapsed", "relative"),
+ ...
+ )
+ }```

Let's compare the two functions for all integers from 1 to 10^4:

```collatz_benchmark(1:10^4, replications = 20)
test replications elapsed relative
1 cpp_collatz(nums) 20 0.03 1.000
4 sapply(nums, collatz) 20 0.53 17.667
2 sug_collatz(nums) 20 0.03 1.000
3 vec_collatz(nums) 20 51.36 1712.000```

And for all integers from 1 to 10^5:

```collatz_benchmark(1:10^5, replications = 20)
test replications elapsed relative
1 cpp_collatz(nums) 20 0.45 1.071
4 sapply(nums, collatz) 20 5.76 13.714
2 sug_collatz(nums) 20 0.42 1.000
3 vec_collatz(nums) 20 753.08 1793.048
```

As we can see 'cpp_collatz' and 'sug_collatz' are almost identical when it comes to efficiency and both are far more efficient than 'vec_collatz', and increasingly so for larger sequences of integers. Also notice the relative efficiency of the hybrid approach compared to 'vec_collatz'.

Let's benchmark on a sample of 1000 integers from 1 to 10^6 for a more realistic comparison of four approaches:

```set.seed(20)
collatz_benchmark(sample(1:10^6, 1000), replications = 100) # will take a while to run!
test replications elapsed relative
1 cpp_collatz(nums) 100 0.05 1.667
4 sapply(nums, collatz) 100 0.23 7.667
2 sug_collatz(nums) 100 0.03 1.000
3 vec_collatz(nums) 100 31.15 1038.333```

The results are remarkable: the C++ functions 'sug_collatz' is the winner with 'cpp_collatz' a close second. The slight advantage of 'sug_collatz' may be due to 'sapply' in C++ using a more efficient method for running the loop (such as iterators). Moreover, both functions are about 1000 times faster than 'vec_collatz'! Even though 'vec_collatz' was specifically written to take advantage of vectorization in R, it still pales in comparison to the might of C++. Even more surprising is that the hybrid approach is only about 8 times slower than C++ approach, which is not bad at all. But if your "vectorization" consists of wrapping a function in 'sapply', then it's best to do it in C++ as we did with 'sug_collatz' than to do it in R.

One take-away lesson for us is that istead of spending a lot of time and effort making our R code as efficient as possible, it may be worth investing that time in learning how to code in a language like C++. Especially since packages like 'Rcpp' and the 'Rcpp-sugar' extension give us the best of both worlds. Another take away is that hybrid solutions are a good alternative in cases when our C++ code is efficient but not vectorized.