Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Let's review the Collatz conjecture, which says that given a positive integer n, the following recursive algorithm will always terminate:

1. if n is 1, stop, otherwise recurse on the following
2. if n is even, then divide it by 2
3. if n is odd, then multiply it by 3 and add 1

In our last post, we created a function called 'cpp_collatz', which given an integer vector, returns an integer vector of their corresponding stopping times (number of iterations for the above algorithm to reach 1).  For example, when n = 5 we have

5 -> 3*5+1 = 16 -> 16/2 = 8 -> 8/2 = 4 -> 4/2 = 2 -> 2/2 = 1,

giving us a stopping time of 5 iterations.

```options(scipen = 999) # turn off scientific notation
require("Rcpp")
sourceCpp(file.path(directory, "collatz.cpp"))
cpp_collatz(1:10) # the stopping times of the first 10 integers
 0 1 7 2 5 8 16 3 19 6```

In today's article, we want to perform some exploratory data analysis to see if we can find any pattern relating an integer to its stopping time. As part of the analysis, we will extract some features out of the integers that could help us explain any differences in stopping times.

Here are some examples of potentially useful features:

• Is the integer a prime number?
• What are its proper divisors?
• What is the remainder upon dividing the integer by some other number?
• What are the sum of its digits?
• Is the integer a triangular number?
• Is the integer a square number?
• Is the integer a pentagonal number?

In case you are encountering these terms for the first time, a triangular number is any number m that can be written as m = n(n+1)/2, where n is some positive integer. To determine if a number is triangular, we can rewrite the above equation as n^2 + n – 2m = 0, and use the quadriatic formula to get n = (-1 + sqrt(1 + 8m))/2 and (-1 – sqrt(1 + 8m))/2. Since n must be a positive integer, we ignore the latter solution, leaving us with (-1 + sqrt(1 + 8m))/2.

Thus, if plugging m in the above formula results in an integer, we can say that m is a triangular number. Similar rules exist to determine if an integer is square or pentagonal, but I will refer you to Wikipedia for the details.

For the purpose of conducting our analysis, we created some other functions in C++ and R to help us. Let's take a look at these functions:

```cat(paste(readLines(file.path(directory, "collatz.cpp")), collapse = "n"))

#include
#include
using namespace Rcpp;

// [[Rcpp::export]]
int collatz(int nn) {
int ii = 0;
while (nn != 1) {
if (nn % 2 == 0) nn /= 2;
else nn = 3 * nn + 1;
ii += 1;
}
return ii;
}

// [[Rcpp::export]]
IntegerVector cpp_collatz(IntegerVector ints) {
return sapply(ints, collatz);
}

// [[Rcpp::export]]
bool is_int_prime(int nn) {
if (nn < 1) stop("int must be greater than 1.");
else if (nn == 1) return FALSE;
else if (nn % 2 == 0) return (nn == 2);
for (int ii=3; ii*ii<=nn; ii+=2) {
if (nn % ii == 0) return false;
}
return true;
}

// [[Rcpp::export]]
LogicalVector is_prime(IntegerVector ints) {
return sapply(ints, is_int_prime);
}

// [[Rcpp::export]]
NumericVector gen_primes(int n) {
if (n < 1) stop("n must be greater than 0.");
std::vector primes;
primes.push_back(2);
int i = 3;
while (primes.size() < unsigned(n)) {
if (is_int_prime(i)) primes.push_back(i);
i += 2;
}
return Rcpp::wrap(primes);
}

// [[Rcpp::export]]
NumericVector gen_divisors(int n) {
if (n < 1) stop("n must be greater than 0.");
std::vector divisors;
divisors.push_back(1);
for(int i = 2; i <= sqrt(n); i++) {
if(n%i == 0) {
divisors.push_back(i);
divisors.push_back(n/i);
}
}
sort(divisors.begin(), divisors.end());
divisors.erase(unique(divisors.begin(), divisors.end()), divisors.end());
return Rcpp::wrap(divisors);
}

// [[Rcpp::export]]
bool is_int_perfect(int nn) {
if (nn < 1) stop("int must be greater than 0.");
return nn == sum(gen_divisors(nn));
}

// [[Rcpp::export]]
LogicalVector is_perfect(IntegerVector ints) {
return sapply(ints, is_int_perfect);
}```

Here is a list of other helper functions in collatz.cpp:

• 'is_prime': given an integer vector, returns a logical vector
• 'gen_primes': given some integer input, n, generates the first n prime numbers
• 'gen_divisors': given an integer n, returns an integer vector of all its proper divisors
• 'is_perfect': given an integer vector, returns a logical vector
```sum_digits <- function(x) {
# returns the sum of the individual digits that make up x
# x must be an integer vector
f <- function(xx) sum(as.integer(strsplit(as.character(xx), "")[]))
sapply(x, f)
}```

As you can guess, many of the above functions (such as 'is_prime' and 'gen_divisors') rely on loops, which makes C++ the ideal place to perform the computation. So we farmed out as much of the heavy-duty computations to C++ leaving R with the task of processing and analyzing the resulting data.

Let's get started. We will perform the analysis on all integers below 10^5, since R is memory-bound and we can run into a bottleneck quickly. But next time, we will show you how to overcome this limitation using the 'RevoScaleR' package, which will allow us to scale the analysis to much larger integers.

One small caveat before we start: I enjoy dabbling in mathematics but I know very little about number theory. The analysis we are about to perform is not meant to be rigorous. Instead, we will attempt to approach the problem using EDA the same way that we approach any data-driven problem.

```maxint <- 10^5
df <- data.frame(int = 1:maxint) # the original number

df <- transform(df, st = cpp_collatz(int)) # stopping times

df <- transform(df,
sum_digits = sum_digits(int),
inv_triangular = (sqrt(8*int + 1) - 1)/2, # inverse triangular number
inv_square = sqrt(int), # inverse square number
inv_pentagonal = (sqrt(24*int + 1) + 1)/6 # inverse pentagonal number
)```

To determine if a numeric number is an integer or not, we need to be careful about not using the '==' operator in R, as it is not guaranteed to work, because of minute rounding errors that may occur. Here's an example:

.3 == .4 - .1 # we expect to get TRUE but get FALSE instead
 FALSE

The solution is to check whether the absolute difference between the above numbers is smaller than some tolerance threshold.

eps <- 1e-9
abs(.3 - (.4 - .1)) < eps # returns TRUE
 TRUE

```df <- transform(df,
is_triangular = abs(inv_triangular - as.integer(inv_triangular)) < eps,
is_square = abs(inv_square - as.integer(inv_square)) < eps,
is_pentagonal = abs(inv_pentagonal - as.integer(inv_pentagonal)) < eps,
is_prime = is_prime(int),
is_perfect = is_perfect(int)
)

df <- df[ , names(df)[-grep("^inv_", names(df))]]```

Finally, we will create a variable listing all of the integer's proper prime divisors. Every composite integer can be recunstruced out of these basic building blocks, a mathematical result known as the 'unique factorization theorem'. We can use the function 'gen_divisors' to get a vector of an integers proper divisors, and the 'is_prime' function to only keep the ones that are prime. Finally, because the return object must be a singleton, we can use 'paste' with the 'collapse' argument to join all of the prime divisors into a single comma-separated string.

```all_prime_divs <- function(x) {
all_div <- gen_divisors(x)
all_div <- all_div[is_prime(all_div)]
paste(all_div, collapse = ",")
}

#all_prime_divs(12)

df\$all_prime_divs = sapply(df\$int, all_prime_divs)
df <- transform(df, all_prime_divs = ifelse(is_prime, int, all_prime_divs))```

Lastly, on its own, we may not find the variable 'all_prime_divs' especially helpful. Instead, we cangenerate multiple flag variables out of it indicating whether or not a specific prime number is a divisor for the integer. We will generate 25 flag variables, one for each of the first 25 prime numbers.

```library("stringr")
for (p in gen_primes(25)) {
df[ , sprintf("is_div_by_%d", p)] <- str_detect(df\$all_prime_divs, sprintf('\b%d\b', p))
}```

There are many more features that we can extract from the underlying integers, but we will stop here. As we mentioned earlier, our goal is not to provide a rigorous mathematical work, but show you how the tools of data analysis can be brought to bear to tackle a problem of such nature.
Here's a sample of 10 rows from the data:

```df[sample.int(nrow(df), 10), ]
int st sum_digits is_triangular is_square is_pentagonal is_prime
21721 21721 162 13 FALSE FALSE FALSE FALSE
36084 36084 142 21 FALSE FALSE FALSE FALSE
40793 40793 119 23 FALSE FALSE FALSE FALSE
3374 3374 43 17 FALSE FALSE FALSE FALSE
48257 48257 44 26 FALSE FALSE FALSE FALSE
42906 42906 49 21 FALSE FALSE FALSE FALSE
37283 37283 62 23 FALSE FALSE FALSE FALSE
55156 55156 60 22 FALSE FALSE FALSE FALSE
6169 6169 111 22 FALSE FALSE FALSE FALSE
77694 77694 231 33 FALSE FALSE FALSE FALSE
is_perfect all_prime_divs is_div_by_2 is_div_by_3 is_div_by_5 is_div_by_7
21721 FALSE 7,29,107 FALSE FALSE FALSE TRUE
36084 FALSE 2,3,31,97 TRUE TRUE FALSE FALSE
40793 FALSE 19,113 FALSE FALSE FALSE FALSE
3374 FALSE 2,7,241 TRUE FALSE FALSE TRUE
48257 FALSE 11,41,107 FALSE FALSE FALSE FALSE
42906 FALSE 2,3,7151 TRUE TRUE FALSE FALSE
37283 FALSE 23,1621 FALSE FALSE FALSE FALSE
55156 FALSE 2,13789 TRUE FALSE FALSE FALSE
6169 FALSE 31,199 FALSE FALSE FALSE FALSE
77694 FALSE 2,3,23,563 TRUE TRUE FALSE FALSE
is_div_by_11 is_div_by_13 is_div_by_17 is_div_by_19 is_div_by_23
21721 FALSE FALSE FALSE FALSE FALSE
36084 FALSE FALSE FALSE FALSE FALSE
40793 FALSE FALSE FALSE TRUE FALSE
3374 FALSE FALSE FALSE FALSE FALSE
48257 TRUE FALSE FALSE FALSE FALSE
42906 FALSE FALSE FALSE FALSE FALSE
37283 FALSE FALSE FALSE FALSE TRUE
55156 FALSE FALSE FALSE FALSE FALSE
6169 FALSE FALSE FALSE FALSE FALSE
77694 FALSE FALSE FALSE FALSE TRUE
is_div_by_29 is_div_by_31 is_div_by_37 is_div_by_41 is_div_by_43
21721 TRUE FALSE FALSE FALSE FALSE
36084 FALSE TRUE FALSE FALSE FALSE
40793 FALSE FALSE FALSE FALSE FALSE
3374 FALSE FALSE FALSE FALSE FALSE
48257 FALSE FALSE FALSE TRUE FALSE
42906 FALSE FALSE FALSE FALSE FALSE
37283 FALSE FALSE FALSE FALSE FALSE
55156 FALSE FALSE FALSE FALSE FALSE
6169 FALSE TRUE FALSE FALSE FALSE
77694 FALSE FALSE FALSE FALSE FALSE
is_div_by_47 is_div_by_53 is_div_by_59 is_div_by_61 is_div_by_67
21721 FALSE FALSE FALSE FALSE FALSE
36084 FALSE FALSE FALSE FALSE FALSE
40793 FALSE FALSE FALSE FALSE FALSE
3374 FALSE FALSE FALSE FALSE FALSE
48257 FALSE FALSE FALSE FALSE FALSE
42906 FALSE FALSE FALSE FALSE FALSE
37283 FALSE FALSE FALSE FALSE FALSE
55156 FALSE FALSE FALSE FALSE FALSE
6169 FALSE FALSE FALSE FALSE FALSE
77694 FALSE FALSE FALSE FALSE FALSE
is_div_by_71 is_div_by_73 is_div_by_79 is_div_by_83 is_div_by_89
21721 FALSE FALSE FALSE FALSE FALSE
36084 FALSE FALSE FALSE FALSE FALSE
40793 FALSE FALSE FALSE FALSE FALSE
3374 FALSE FALSE FALSE FALSE FALSE
48257 FALSE FALSE FALSE FALSE FALSE
42906 FALSE FALSE FALSE FALSE FALSE
37283 FALSE FALSE FALSE FALSE FALSE
55156 FALSE FALSE FALSE FALSE FALSE
6169 FALSE FALSE FALSE FALSE FALSE
77694 FALSE FALSE FALSE FALSE FALSE
is_div_by_97
21721 FALSE
36084 TRUE
40793 FALSE
3374 FALSE
48257 FALSE
42906 FALSE
37283 FALSE
55156 FALSE
6169 FALSE
77694 FALSE```

We can now move on to looking at various statistical summaries to see if we notice any differences between the stopping times (our response variable) when we break up the data in different ways. We will look at the counts, mean, median, standard deviation, and trimmed mean (after throwing out the highest 10 percent) of the stopping times, as well as the correlation between stopping times and the integers. This is by no means a comprehensive list, but it can serve as a guidance for deciding which direction to go to next.

```my_summary <- function(df) {
primes <- gen_primes(9)
res <- with(df, data.frame(
count = length(st),
mean_st = mean(st),
median_st = median(st),
tmean_st = mean(st[st < quantile(st, .9)]),
sd_st = sd(st),
cor_st_int = cor(st, int, method = "spearman")
)
)
}```

To create above summaries broken up by the flag variables in the data, we will use the 'ddply' function in the 'plyr' package. For example, the following will give us the summaries asked for in 'my_summary', grouped by 'is_prime'.

```library("plyr")
ddply(df, ~ is_prime, my_summary)
is_prime count mean_st median_st tmean_st sd_st cor_st_int
1 FALSE 90408 107.1227 99 95.95392 51.29145 0.1693013
2 TRUE 9592 111.4569 103 100.39643 51.90186 0.1953000```

To avoid having to manually type every formula, we can pull the flag variables from the data set, generate the strings that will make up the formula, wrap it inside 'as.formula' and pass it to 'ddply'.

```flags <- names(df)[grep("^is_", names(df))]
res <- lapply(flags, function(nm) ddply(df, as.formula(sprintf('~ %s', nm)), my_summary))
names(res) <- flags
res
\$is_triangular
is_triangular count mean_st median_st tmean_st sd_st cor_st_int
1 FALSE 99554 107.58511 99 96.23178 51.36153 0.1700491
2 TRUE 446 97.11211 94 85.97243 51.34051 0.3035063

\$is_square
is_square count mean_st median_st tmean_st sd_st cor_st_int
1 FALSE 99684 107.59743 99 96.25033 51.34206 0.1696582
2 TRUE 316 88.91772 71 77.29577 55.43725 0.4274504

\$is_pentagonal
is_pentagonal count mean_st median_st tmean_st sd_st cor_st_int
1 FALSE 99742 107.56948 99.0 96.22466 51.35580 0.1702560
2 TRUE 258 95.52326 83.5 83.86638 53.91514 0.3336478

\$is_prime
is_prime count mean_st median_st tmean_st sd_st cor_st_int
1 FALSE 90408 107.1227 99 95.95392 51.29145 0.1693013
2 TRUE 9592 111.4569 103 100.39643 51.90186 0.1953000

\$is_perfect
is_perfect count mean_st median_st tmean_st sd_st cor_st_int
1 FALSE 99995 107.5419 99 96.20092 51.36403 0.1707839
2 TRUE 5 37.6000 18 19.50000 45.06440 0.9000000

\$is_div_by_2
is_div_by_2 count mean_st median_st tmean_st sd_st cor_st_int
1 FALSE 50000 113.5745 106 102.58324 52.25180 0.1720108
2 TRUE 50000 101.5023 94 90.68234 49.73778 0.1737786

\$is_div_by_3
is_div_by_3 count mean_st median_st tmean_st sd_st cor_st_int
1 FALSE 66667 107.4415 99 96.15832 51.30838 0.1705745
2 TRUE 33333 107.7322 99 96.94426 51.48104 0.1714432

\$is_div_by_5
is_div_by_5 count mean_st median_st tmean_st sd_st cor_st_int
1 FALSE 80000 107.5676 99 96.21239 51.36466 0.1713552
2 TRUE 20000 107.4214 99 96.13873 51.37206 0.1688929

.
.
.

\$is_div_by_89
is_div_by_89 count mean_st median_st tmean_st sd_st cor_st_int
1 FALSE 98877 107.5381 99 96.19965 51.37062 0.1707754
2 TRUE 1123 107.5628 97 96.02096 50.97273 0.1803568

\$is_div_by_97
is_div_by_97 count mean_st median_st tmean_st sd_st cor_st_int
1 FALSE 98970 107.4944 99 96.15365 51.36880 0.17187453
2 TRUE 1030 111.7660 106 101.35853 50.93593 0.07843996```

As we can see from the above results, most comparisons appear to be non-significant (although, in mathematics, even tiny differences can be meaningful, therefore we will avoid relying on statistical significance here). Here's a summary of trends that stand out as we go over the results:

1. On average, stopping times are slightly higher for prime numbers compared to composite numbers.
2. On average, stopping times are slightly lower for triangular numbers, square numbers, and pentagonal numbers compared to their corresponding counterparts.
3. Despite having lower average stopping times, triangular numbers, square numbers and pentagonal numbers are more strongly correlated with their stopping times than their corresponding counterparts. A larger sample size may help in this case.
4. On average, odd numbers have a higher stopping time than even numbers.

This last item could just be restatement of 1. however. Let's take a closer look:

```ddply(df, ~ is_prime + is_div_by_2, my_summary)
is_prime is_div_by_2 count mean_st median_st tmean_st sd_st cor_st_int
1 FALSE FALSE 40409 114.0744 106 102.91178 52.32495 0.1658988
2 FALSE TRUE 49999 101.5043 94 90.68434 49.73624 0.1737291
3 TRUE FALSE 9591 111.4685 103 100.40796 51.89231 0.1950482
4 TRUE TRUE 1 1.0000 1 NaN NA NA```

When limiting the analysis to odd numbers only, prime numbers now have a lower average stopping time compared to composite numbers, which reverses the trend seen in 1.

Abd ibe final point: Since having a prime number as a proper divisor is a proxy for being a composite number, it is difficult to read too much into whether divisibility by a specific prime number affects the average stopping time. But no specific prime number stands out in particular. Once again, a larger sample size would give us more confidence about the results.

The above analysis is perhaps too crude and primitive to offer any significant lead, so here are some possible improvements:

• We could think of other features that we could add to the data.
• We could think of other statistical summaries to include in the analysis.
• We could try to scale up by looking at all integers form 1 to 1 billion instead of 1 to 1 million.

In the next post, we will see how we can use the 'RevoScaleR' package to go from an R 'data.frame' (memory-bound) to an external 'data.frame' or XDF file for short. In doing so, we will achieve the following improvements:

1. as the data will no longer be bound by available memory, we will scale with the size of the data,
2. since the XDF format is also a distributed data format, we will use the multiple cores available on a single machine to distribute the computation itself by having each core process a separate chunk of the data. On a cluster of machines, the same analysis could then be distributed over the different nodes of the cluster.