**The Pith of Performance**, and kindly contributed to R-bloggers)

Popping Part III off the stack—where I ended up unexpectedly discovering that the `primes` and `primlist` functions are broken in the **schoolmath** package on CRAN—let’s see what prime numbers look like when computed correctly in R. To do this, I’ve had to roll my own prime number generating function.

**Personalizing primes in R**

For what I want to show, I mostly need a list of primes in some arbitrary range. Here is a recursive function to do just that:

# "To iterate is human, to recurse, divine." --Peter Deutsch

primegen=function(v){

return(sapply(v,function(z){sum(z/1:z==z%/%1:z)})==2)

}

Prime numbers only have themselves and 1 as factors. All other numbers are called *composites*. Here are the first 10 primes:

> nmax<-30

> primelist <- (1:nmax)[primegen(1:nmax)]

> primelist

[1] 2 3 5 7 11 13 17 19 23 29

> length(primelist)

[1] 10

Not everything in **schoolmath** is broken, so I don’t have to abandon it altogether. I can use it to do things like check that this list of numbers are all primes:

library(schoolmath)

> is.prim(primelist)

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

**The point of primes**

It was once thought by mathematicians that prime numbers could never have any application to the real world. As usual, that view has turned out to be quite mistaken—especially since the advent of computers. Today, prime numbers are used in cryptography, hash tables, pseudo-random number generation and distributed computing.

My interest, however, is in none of these things. I want to show a connection between prime numbers and scalability in highly parallel computing, which I will get to in Part I. But before I do that, I need to review some of the properties of prime numbers that I will be referring to in Part I. Remember, I’m going backwards because I’m now popping this stack.

**Mersenne primes**

Since all primes (except 2) are *odd* numbers, any number M having the form 2^{p} − 1 will be odd, but will it also be prime? Here, you have to be a bit careful. What if we choose the exponent p to be prime, will that guarantee that M is prime? Sounds like a good idea. Let’s try it for p = 17, since that’s one of the first ten primes we generated:

> p<-17

> M17<-2^p-1

> M17

[1] 131071

> is.prim(M17)

[1] TRUE

Seems to work. What about p = 29; the last of the primes on our list of ten?

> p<-29

> M29<-2^p-1

> M29

[1] 536870911

> is.prim(M29)

[1] FALSE

> prime.factor(M29)

[1] 233 1103 2089

> 233 * 1103 * 2089

[1] 536870911

M29 has factors other than itself and unity, so it’s not prime, even though p is prime. There goes that idea.

On the other hand, if you find an M that is prime, then it’s guaranteed that p will be prime. This makes it difficult to find Ms that are prime, but computers can help. These particular prime numbers are known as Mersenne primes, which you may have heard of because computers are now used collaboratively to discover and validate their existence using software like GIMPS (the SETI@home of mathematics). [Who said math isn’t experimental?] The largest validated prime number of this type contains more than 12 million digits.

**A paucity of provable patterns**

Mathematicians like to find patterns in order to write down equations and theorems and, more importantly, *prove* them. When it comes to the primes, however, it is exceedingly difficult to find patterns that can be expressed as equations. But sometimes you can get lucky, as did the eminent mathematician Stan Ulam while doodling at a conference. He discovered a grid pattern amongst the primes when laid out on a logarithmic spiral. See! Daydreaming and doodling are a good thing, despite what your boss might think.

Three fundamental results about primes are:

- Goldbach’s conjecture (1742)
- Gauss’ prime number theorem (1792, extended in 1849 and published in 1863)
- Riemann’s hypothesis (1859)

The first two are easily understood by non-mathematicians and I’ll use R to discuss them here. The third (aka RH) is the deepest result and would take us too far afield to discuss here. RH would also explain Gauss’ result if it could be proven. To this day, all three results remain unproven. BTW, proving RH has $1 million attached to it, if you have nothing better to do some weekend.

**Numbers are polyatomic primes**

We can think of every prime number as being *atomic*, because it’s not decomposable into any other factors (by definition). The fundamental theorem of arithmetic states that **all** numbers are *molecules* comprised of prime atoms. To see this, let’s take a random set of integers between 20 an 30:

> smpl <- sample(20:30)

> cat(smpl,"\n")

25 21 20 26 28 29 23 27 30 22 24

Now, let’s apply the **schoolmath** functions to see which atoms comprise the composite numbers:

> for(i in 1:length(smpl)) {

+ if(!is.prim(smpl[i])) {

+ pf<-prime.factor(smpl[i])

+ cat(paste(smpl[i],":",sep=""),pf,"\n")

+ }

+ }

25: 5 5

21: 3 7

20: 2 2 5

26: 2 13

28: 2 2 7

27: 3 3 3

30: 2 3 5

22: 2 11

24: 2 2 2 3

Every composite number in the list has been expressed as a molecule composed of only prime numbers as factors. The numbers 23 and 29 in the original list do not appear because they are atomic primes. This is a *multiplicative* property of prime numbers. In the section called **Goldbach’s Comet**, we’ll look at an *additivity* property of primes.

**Gauss’ counter**

What is both engrossing and exasperating about the primes, is summed up in this quote:

“There are two facts about the distribution of prime numbers which I hope to convince you so overwhelmingly that they will be permanently engraved in your hearts.The first is that despite their simple definition and role as the building blocks of the natural numbers, the prime numbers… grow like weeds among the natural numbers, seeming to obey no other law than that of chance, and nobody can predict where the next one will sprout.

The second fact is even more astonishing, for it states just the opposite: that the prime numbers exhibit stunning regularity, that there are laws governing their behaviour, and that they obey these laws with almost military precision.”—Don Zagier (1975)

To get a sense of the way the prime numbers are distributed amongst the composite numbers, we can count the number of primes up to an arbitrary number (n), for successively greater values of n. We saw earlier that there were 10 primes below n = 30. I wrote the following R code to count the primes up to n = 200. Since I’m going to the trouble of doing that, I’ve also computed some related interesting things in R, which I’ll explain shortly.

nmax <- 200

intgnd <- function(x) {1/log(x)} # logarithmic integral

np <- c(0) # number of primes up to n

gp <- rep(0,9) # Gauss' early estimate

li <- c(NULL) # Gauss' later estimate

for (n in 1:nmax) {

if(n > 1) {

i <- which(primelist[1:length(primelist)] <= n)

np <- append(np,max(i))

li <- append(li,integrate(intgnd,2,n)$value)

}

if(n >= 10) gp <- append(gp, n/log(n))

}

plot(np,type="s",lwd=2,xlab="Number (n)",ylab="Count of primes < n")

lines(10:nmax,gp[10:nmax],col="red")

lines(li,col="blue")

The black step-curve in Figure 1 is the count of primes (y-axis) up to n = 200 (x-axis). It belongs to the values of `np` in the R code. The reason it has a step structure arises from the fact that the same number of primes may exist below successive values of n. For example, the same 4 primes (i.e., 2, 3, 5, 7) are counted when going from n = 7 to n = 10. It’s a *discrete* function. But it should be clear, even from this relatively small range of n that the **width** of the steps is highly *irregular* or random looking.

The red curve is Gauss’ original estimate about the distribution of the primes. It belongs to the values of `gp` in the R code. The first thing that is remarkable about this curve is just the idea that a *discrete* counting function could be represented by a *continuous* function: `n/log(n)` in R. Granted it is not an *exact* fit but rather a lower bound (i.e., the count of primes can never be *less* than the red line). But that’s the second remarkable thing. Gauss came up with this lower bound when he was a teenager. What did you do when you were a teenager? So, let’s not be too critical.

Years later, around 1849, Gauss managed to come up with a vastly improved approximation, shown in Figure 1 as the blue curve. It belongs to the values of `li` in the R code. This time it’s an *upper* bound and it’s an even tighter bound than the red curve.

In 1859, one of Gauss’ former students, Georg Bernhard Riemann (from whom Einstein later borrowed heavily), extended the study of prime numbers from 1-dimension to 2-dimensions, viz., the complex plane; another simple idea that turned out to be revolutionary. It leads to something called the Riemann zeta-function, which we are most definitely not going into here. A consequence of Riemann’s function is that it can be used to generate even better approximations to the prime counting function. In Figure 1, I’ve shown it as the green curve for comparison. You may need to click on the figure to enlarge it. Actually, this is only the leading order approximation. Higher order terms in the Riemann zeta-function make the curve actually take on the step structure of prime counter with more and more accuracy. This is truly incredible! We can see that it works, we just can’t *prove* that it works … yet. That’s why there’s a million dollars attached to a proof.

**Skewes in the data**

Gauss’ blue curve is an upper bound on the prime count. It looks like the black and blue curves would never collide, and Gauss thought they didn’t. But Figure 1 only goes up to n = 200, which is very small potatoes to a mathematician. In 1914 it was shown that not only do the two curves cross, they cross *infinitely* many times! OK, so at what value of n does the first intersection occur?

You know what a google is. You know what a googol is, right? Well, that’s numerical peanuts. An initial estimate for the first intersection was given by

n = 10 ^{101034}

a special number known as Skewes number. More recent estimates (2005) now put the first crossing at around 10^{316}; big, but much smaller than originally thought.

Previously, I mentioned that we can see that RH works, we just haven’t managed to prove that it works. But seeing depends on where you are looking. In Figure 1, there’s no hint that that black and blue curve intersect, but we know for sure that there are an infinite number of such intersections. That’s why proofs are important in mathematics.

**Goldbach’s comet**

Earlier, I mentioned that the fundamental theorem of arithmetic relies on the multiplicative property of primes. Here, we look at the *additivity* properties of primes. In 1742, the Prussian amateur mathematician Christian Goldbach wrote a letter to the great Leonhard Euler and conjectured that every even number greater than 2 can be expressed as the sum of two primes.

Let’s check that out in R:

library(schoolmath)

gbp<-function(x) {

i <- 1

parts <- NULL

while (i <= x/2) {

if (is.prim(i) && is.prim(x-i)) {

if (i > 1) {

parts <- append(parts,sprintf("%d+%d",i,x-i))

}

}

i <- ifelse(x == 4,i+1,i+2)

}

return(parts)

}

for (k in seq(4,14,2)) cat(sprintf("%2d: ", k), gbp(k), "\n")

for (k in seq(52,60,2)) cat(sprintf("%2d: ", k), gbp(k), "\n")

which produces:

4: 2+2

6: 3+3

8: 3+5

10: 3+7 5+5

12: 5+7

14: 3+11 7+7

52: 5+47 11+41 23+29

54: 7+47 11+43 13+41 17+37 23+31

56: 3+53 13+43 19+37

58: 5+53 11+47 17+41 29+29

60: 7+53 13+47 17+43 19+41 23+37 29+31

Notice that there can be more than one way to write the sum of primes as the even numbers get bigger. If we count the number of sums (partitions), much like counting the primes themselves:

i <- 1

ge <- NULL

ne <- NULL

for (k in seq(4,2000,2)) {

ne <- append(ne,k)

ge <- append(ge,length(gbp(k)))

}

plot(ne,ge,pch=4,col="blue",xlab="Number (n)",ylab="Distinct partitions")

and plot them in Figure 2, we see something called Goldbach’s comet.

The trend in the distribution of these points (partitions) is very similar the trends in Figure 1; a kind of regularity in the mess:

“One of the remarkable aspects of the distribution of prime numbers is their tendency to exhibit global regularity and local irregularity. The prime numbers behave like the ‘ideal gases’ which physicists are so fond of. Considered from an external point of view, the distribution is – in broad terms – deterministic, but as soon as we try to describe the situation at a given point, statistical fluctuations occur as in a game of chance where it is known that on average the heads will match the tail but where, at any one moment, the next throw cannot be predicted. Prime numbers try to occupy all the room available (meaning that they behave as randomly as possible), given that they need to be compatible with the drastic constraint imposed on them, namely to generate the ultra-regular sequence of integers.

This idea underpins the majority of conjectures concerning prime numbers: everything which is not trivially forbidden should actually happen…”—G. Tenenbaum and M. Mendèfs France (2000)

If you look back over the shapes of the data in Figures 1 and 2, and especially the continuous curves in Figure 1, you will notice that they resemble the shape of throughput curves in performance analysis. And that will bring the next post to the top of the stack: how all this relates to Amdahl’s law.

**leave a comment**for the author, please follow the link and comment on his blog:

**The Pith of Performance**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...