R and the Collatz Conjecture: Part 2

May 6, 2014
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Seth Mottaghinejad, Analytic Consultant for Revolution Analytics


In the last article, we showed two separate R implementations of the Collatz conjecture: 'nonvec_collatz' and 'vec_collatz', with the latter being more efficient than the former because of the way it takes advantage of vectorization in R. Let's once again take a look at 'vec_collatz':

vec_collatz <- function(ints) {
+ niter <- integer(length(ints))
+ while (abs(sum(ints - 1)) > .01) {
+ niter <- niter + ifelse(ints == 1, 0, 1)
+ ints <- ifelse(ints == 1, ints, ifelse(ints %% 2 == 0, ints / 2, 3*ints + 1))
+ }
+ 
+ niter
+ }

Today we will learn a thrid, and far more efficient way of implementing the Collatz conjecture. It involves rewriting the function in C++ and using the 'Rcpp' package in R to compile and run the function without ever leaving the R environment.

library("Rcpp")

One important difference between R and C++ is that when you write a C++ function, you need to declare your variable types. The C++ code chunk shown below creates a function called 'cpp_collatz' which takes an input of type 'IntegerVector' and whose output is of type 'IntegerVector'. Unlike in R, where explicit loops can slow your code down, loops in C++ are usually very efficient, even though they are tedious to write.

cpptxt <- '
+ IntegerVector cpp_collatz(IntegerVector ints) {
+ IntegerVector iters(ints.size());
+ for (int i=0; i<ints.size(); i++) {
+ int nn = ints(i);
+ while (nn != 1) {
+ if (nn % 2 == 0) nn /= 2;
+ else nn = 3 * nn + 1;
+ iters(i) += 1;
+ }
+ }
+ return iters;
+ }'
 
cpp_collatz <- cppFunction(cpptxt)
set.seed(20)
cpp_collatz(sample(20))
[1] 20 17 8 19 4 20 1 0 2 5 3 16 9 7 17 7 12 6 14 9
set.seed(20)
a <- vec_collatz(sample(2000))
set.seed(20)
b <- cpp_collatz(sample(2000))
all.equal(a, b) # should return TRUE
[1] TRUE

Let's now redo our C++ implementation in a slightly different way. We would rather not have C++ code interspersed with R code: Not only does it make it hard to read, but we also won't be able to take advantage of syntax-hightlighting specific to C++ (among other annoyances). So let's store the C++ code in a file we call 'collatz.cpp' and use the 'sourceCpp' function in R to call it. Here is the content of 'collatz.cpp':

cat(paste(readLines(file("collatz.cpp")), collapse = "\n"))
 
#include <Rcpp.h>
using namespace Rcpp;
 
// [[Rcpp::export]]
int collatz(int nn) {
int ii = 0;
while (nn != 1) {
if (nn % 2 == 0) nn /= 2;
else nn = 3 * nn + 1;
ii += 1;
}
return ii;
}
 
// [[Rcpp::export]]
IntegerVector cpp_collatz(IntegerVector ints) {
IntegerVector iters(ints.size());
for (int i=0; i<ints.size(); i++) {
iters(i) = collatz(ints(i));
}
return iters;
}
 
// [[Rcpp::export]]
IntegerVector sug_collatz(IntegerVector ints) {
return sapply(ints, collatz);
}

There are three things worth mentioning about the above code chunk:

  1. We broke up the function in two: one function 'collatz' that operates on a single integer and returns its stopping time, and another function 'cpp_collatz' that runs the 'collatz' function on a vector of integers and returns a vector of stopping times. This makes the code easier to read (no more nested loops) and more modular.
  2. There are two extra lines at the top, which tell C++ what namespace to use, as well as an extra line '// [[Rcpp::export]]' before each function definition. The latter is used to let Rcpp know that theses functions should be made available to R once they have been compiled.
  3. The third function, called 'sug_collatz', does the same thing as 'cpp_collatz' but uses the Rcpp-sugar extension, which is an attempt to take common R functions and rewrite them in C++ so the C++ code can look like its R counterpart when possible. This can sometimes come at a very small cost but it saves us a lot of hassle, as you can see.

To compile the C++ code, just type the following in R:

sourceCpp("collatz.cpp")

Assuming that we have a C++ compiler istalled, it will take a few seconds to run. We can now type the following into the R console:

cpp_collatz
function (ints) 
.Primitive(".Call")(<pointer: 0x000000006e042180>, ints)
 
cpp_collatz(1:10) # seems to be working fine.
[1] 0 1 7 2 5 8 16 3 19 6
 
sug_collatz(1:10) # same as above.
[1] 0 1 7 2 5 8 16 3 19 6

And now let's get to the reason we bothered with C++ in the first place: efficiency. There are four comparisons we're interested in:

  1. the 'cpp_collatz' function
  2. the 'sug_collatz' function, which uses 'sapply' in C++ instead of an explicit loop
  3. the 'vec_collatz' function we wrote earlier which is our R implementation
  4. the 'collatz' function we wrote earlier in C++, which works only on a single integer, but we can vectorize it by wrapping it in 'sapply' in R.

We can think of the last case as being a hybrid approach. One reason to include it is because someone may share a complicated piece of C++ code that works but is not vectorized and vectorizing it may turn out to be a non-trivial task (programmers are supposed to be lazy after all!).

collatz_benchmark <- function(nums, ...) {
+ require(rbenchmark)
+ benchmark(
+ cpp_collatz(nums), # runs completely in C++
+ sug_collatz(nums), # runs completely in C++
+ vec_collatz(nums), # runs completely in R in a vectorized fashion
+ sapply(nums, collatz), # runs collatz in C++ but sapply in R
+ columns = c("test", "replications", "elapsed", "relative"),
+ ...
+ )
+ }

Let's compare the two functions for all integers from 1 to 10^4:

collatz_benchmark(1:10^4, replications = 20)
test replications elapsed relative
1 cpp_collatz(nums) 20 0.03 1.000
4 sapply(nums, collatz) 20 0.53 17.667
2 sug_collatz(nums) 20 0.03 1.000
3 vec_collatz(nums) 20 51.36 1712.000

And for all integers from 1 to 10^5:

collatz_benchmark(1:10^5, replications = 20)
test replications elapsed relative
1 cpp_collatz(nums) 20 0.45 1.071
4 sapply(nums, collatz) 20 5.76 13.714
2 sug_collatz(nums) 20 0.42 1.000
3 vec_collatz(nums) 20 753.08 1793.048
 

As we can see 'cpp_collatz' and 'sug_collatz' are almost identical when it comes to efficiency and both are far more efficient than 'vec_collatz', and increasingly so for larger sequences of integers. Also notice the relative efficiency of the hybrid approach compared to 'vec_collatz'.

Let's benchmark on a sample of 1000 integers from 1 to 10^6 for a more realistic comparison of four approaches:

set.seed(20)
collatz_benchmark(sample(1:10^6, 1000), replications = 100) # will take a while to run!
test replications elapsed relative
1 cpp_collatz(nums) 100 0.05 1.667
4 sapply(nums, collatz) 100 0.23 7.667
2 sug_collatz(nums) 100 0.03 1.000
3 vec_collatz(nums) 100 31.15 1038.333

The results are remarkable: the C++ functions 'sug_collatz' is the winner with 'cpp_collatz' a close second. The slight advantage of 'sug_collatz' may be due to 'sapply' in C++ using a more efficient method for running the loop (such as iterators). Moreover, both functions are about 1000 times faster than 'vec_collatz'! Even though 'vec_collatz' was specifically written to take advantage of vectorization in R, it still pales in comparison to the might of C++. Even more surprising is that the hybrid approach is only about 8 times slower than C++ approach, which is not bad at all. But if your "vectorization" consists of wrapping a function in 'sapply', then it's best to do it in C++ as we did with 'sug_collatz' than to do it in R.

One take-away lesson for us is that istead of spending a lot of time and effort making our R code as efficient as possible, it may be worth investing that time in learning how to code in a language like C++. Especially since packages like 'Rcpp' and the 'Rcpp-sugar' extension give us the best of both worlds. Another take away is that hybrid solutions are a good alternative in cases when our C++ code is efficient but not vectorized.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.