A look at R vectorization through the Collatz Conjecture

April 1, 2014
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Seth Mottaghinejad, Analytic Consultant for Revolution Analytics

You may have heard before that R is a vectorized language, but what do we mean by that? One way to read that is to say that many functions in R can operate efficiently on vectors (in addition to singletons). Here are some examples:

> log(1) # input and output are singletons
[1] 0

> log(1:10) # input and output are vectors
[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
[8] 2.0794415 2.1972246 2.3025851

> paste("a", "1") # join two strings together into a single string
[1] "a 1"

> paste(letters[1:10], 1:10) # same as above, but vector-wise
[1] "a 1" "b 2" "c 3" "d 4" "e 5" "f 6" "g 7" "h 8" "i 9" "j 10"

Being aware of which functions are vectorized in R and using them can make a big difference when it comes to writing succinct and efficient R code. For example, had you not known that 'paste' is vectorized, instead of the above line of code you would need to loop through each value of your vector, or alternatively use one of the 'apply' family of functions, both of which are shown below.

> vv <- character(10) # initialize an empty character vector of length 10

> for (i in 1:10) vv[i] <- paste(letters[i], i) # fill in the values

> print(vv)
[1] "a 1" "b 2" "c 3" "d 4" "e 5" "f 6" "g 7" "h 8" "i 9" "j 10"

> sapply(1:10, function(i) paste(letters[i], i)) # use sapply instead of a loop
[1] "a 1" "b 2" "c 3" "d 4" "e 5" "f 6" "g 7" "h 8" "i 9" "j 10"

You can check for yourself that neither of the above two approaches are as efficient as using the 'paste' function in a vectorized fashon.

For a more serious demonstration of what a difference vectorization can make, we will look at an R implementation of the Collatz conjecture. (Check out xkcd for an entertaining presentation.) The Collatz conjecture states that if you start with any positive integer n, and recursively apply the following algorithm, you will eventually reach 1:

• if n is even, then divide it by 2
• if n is odd, then multiply it by 3 and add 1

The number of iterations it takes for n to reach 1 is called its stopping time. To restate, the Collatz conjecture states that any integer n has a finite stopping time.

Let's try a naive implementation of the Collatz conjecture in R (a better implementation would use memoization, which we will not cover). There are two ways to implement the above in R. Here's the first way:

```nonvec_collatz <- function(ints){
collatz <- function(n) {
# n is a single integer
# recusively applies the Collatz conjecture to n
# returns the number of iterations it takes to reach 1
counter <- 0
while (n != 1){
counter <- counter + 1
n <- ifelse(n %% 2  == 0, n / 2, 3*n + 1)
}
counter
}
# we use sapply to apply the above function to a vector of integers
sapply(ints, collatz)
}```

> set.seed(20)
> nonvec_collatz
[1] 20 17 8 19 4 20 1 0 2 5 3 16 9 7 17 7 12 6 14 9

In the above approach, we write a function 'collatz' which, given a single integer n, will determine its stopping time. Since collatz can only take a single integer as input, we can pass it to 'sapply' to get it to work on a vector on integers. The above implementation is wrapped around in a function which we call 'nonvec_collatz'. Although 'nonvec_collatz' is vectorized in the sense that given a vector of integers as input you get a vector of integers as output, vectorization was achieved through the use of 'sapply' and so it is only aestetically vectorized and not necessarily efficient.

In our second approach, we write a function 'vec_collatz' which is truely vectorized:

```vec_collatz <- function(ints){
# we store the number of iterations for each number into niter
niter <- integer(length(ints))

# while there remains a number that has not yet converged to 1, run the loop
while (abs(sum(ints - 1)) > .01){
niter <- niter + ifelse(ints == 1, 0, 1)
ints <- ifelse(ints == 1, ints, ifelse(ints %% 2  == 0, ints / 2, 3*ints + 1))
}

niter
}                                                                                                                             ```

> set.seed(20)

> vec_collatz(sample(20))
[1] 20 17 8 19 4 20 1 0 2 5 3 16 9 7 17 7 12 6 14 9

Notice any diffences between 'nonvec_collatz' and 'vec_collatz'? Both functions run a while loop that stops once the number reaches 1. But in 'vec_collatz' the while loop takes advantage of the fact that 'ifelse' is a vectorized function to run the recursive process on the whole vector of integers, instead of one at a time, as is done by 'nonvec_collatz'.

Let's look how much more efficient 'vec_collatz' is compared to 'nonvec_collatz':

```> library(microbenchmark)

> microbenchmark(vec_collatz(1:10^2), nonvec_collatz(1:10^2), times = 20)
Unit: milliseconds
expr                   min      lq       median   uq       max      neval
vec_collatz(1:10^2)    20.63166 21.82983 23.13596 24.33374 25.24081 20
nonvec_collatz(1:10^2) 42.82601 47.40102 49.23785 50.68589 54.94907 20

> microbenchmark(vec_collatz(1:10^3), nonvec_collatz(1:10^3), times = 20)
Unit: milliseconds
expr                   min      lq       median   uq       max      neval
vec_collatz(1:10^3)    213.4263 221.0164 228.4409 234.2387  260.2224 20
nonvec_collatz(1:10^3) 778.5368 920.8171 957.5107 982.2117 1062.2317 20

> microbenchmark(vec_collatz(1:10^4), nonvec_collatz(1:10^4), times = 20)
Unit: seconds
expr                   min        lq         median    uq        max     neval
vec_collatz(1:10^4)     3.138519  3.183883   3.191266  3.276266  3.35356 20
nonvec_collatz(1:10^4) 12.997234 13.349863  13.396626 13.482648 13.87802 20                                                                                               ```

The results above indicate that 'vec_collatz' is much more efficient than 'nonvec_collatz', and increasingly so for larger sets of numbers.

Now let's use 'vec_collatz' to draw a density plot for the stopping times for all integers from 1 to n = 10^i where we vary i from 1 to 6 (we create a separate plot for each i).

```res <- sapply(1:6, function(x) vec_collatz((10^(x-1)):(10^x-1))) # this will take a while
par(mfrow = c(3, 2))
for(i in 1:6) {
plot(density(table(Reduce(c, res[1:i])), kernel = "rectangular"),
main = sprintf("collatz(1:10^%s)", i))
}```

An interesting pattern emerges when we plot each number against its stopping time.

par(mfrow = c(1, 1))
plot(1:(10^4-1), Reduce(c, res[1:4]), xlab = "n", ylab = "stopping time", lwd = 2, col = "blue")

Whether there is a significance to the observed pattern is the subject of a great many dissertations, and we will take a stab at this problem using a statistical approach in a later post.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...