(This article was first published on

Having finally popped the stack on computing **The Pith of Performance**, and kindly contributed to R-bloggers)**prime numbers**with R in Part II and Part III, we are now in a position to discuss their relevance for computational scalability.

My original intent was to show how poor partitioning of a workload can defeat the linear scalability expected when full parallelism is otherwise attainable, i.e., zero contention (σ) and zero coherency (κ) delays in the USL formulation. Remarkably, this

**load-balance bound**is identical to Gauss' original lower bound on the distribution of the

**prime numbers**. In fact, the subtitle of this post could be:

*Mr. Amdahl meets Mr. Gauss*.

**Universal scalability**

Let's begin by reviewing the USL model, using the same notation as Chapter 4 of my Guerrilla Capacity Planning book:

C(p) | = | p 1 + σ (p − 1) + κ p (p − 1) | (1) |

C(p) is the relative or normalized throughput X

_{p}/X

_{1}. The three terms in the denominator correspond respectively to the

*Three Cs*of scalability degradation: the available degree of

*concurrency*,

*contention*delays (with amount σ) and

*coherency*delays (with amount κ). It's the coherency latency that is responsible for the retrograde roll-off in throughput scalability often seen in typical application load testing or performance testing measurements.

For the purposes of this discussion, I'm going to refer to the USL model for p-processors, but exactly the same argument can be made for the USL with N-processes.

**Amdahl scalability**

In the case where we can achieve zero coherency degradation (i.e., make κ = 0), we are left with only two terms in the denominator and that makes eqn.(1) identical to Amdahl's law.

A(p) | = | p 1 + σ (p − 1) | (2) |

As the number of physical processors becomes very large, for a fixed value of σ, we find:

A(p → ∞) | = | 1 σ | (3) |

which tells us that Amdahl scaling approaches a ceiling (horizontal y-asymptote) given by the

*inverse*of the contention parameter σ. For example, if σ = 0.02 (i.e., the execution time is 2% serialized), the maximum relative Amdahl scaling will be 50 times the uniprocessor throughput.

**Linear scalability**

If we can make σ → 0 (i.e., zero contention) then eqn.(2) says that A(p) will approach linear scaling:

L(p) | = p | (4) |

This case corresponds to best case parallelism or concurrency. Similarly, the ceiling in eqn.(3) rises to infinity because the scalability becomes unbounded from above.

**Load balance bound**

The preceding assumes that all the physical processors are kept maximally busy during execution of the concurrent workload. Suppose, however, that only a subset 1 < m < p of the processors was active during any execution cycle and that a succession of such cycles are required in order to complete the work. How would that impact scalability?

Assuming that each possible processor subset (m) has a

*uniform*probability of being active, we write this probability as:

π(m) | = | 1 p | (5) |

The elapsed time on p processors is then given by the weighted sum of all possible execution times on each processor subset between 1 and p:

T_{p} | = | π(1) T1 | + π(2) T1/2 | + π(3) T1/3 | + ... | + π(p) T1/p | (6) |

From eqn.(5), all the π(m) probabilities are identical and can be factored out such that eqn.(6) simplifies to:

T_{p} | = | T_{1}/p [ 1 | + 1/2 | + 1/3 | + ... | + 1/p ] | (7) |

The terms in the square brackets are a harmonic series which, up to a constant, is equal to the natural logarithm of p.

Log(p) | ≈ | 1 | + 1/2 | + 1/3 | + ... | + 1/p | (8) |

Hence, we can write eqn.(7) as:

T_{p} | = | T_{1} Log(p)p | (9) |

As defined by eqn.(4.13) in Chapter 4 of GCaP, the

**speedup**is T

_{1}/T

_{p}. It is the temporal counterpart of C(p) in the USL. Substituting eqn.(9) for T

_{p}into the speedup definition produces:

G(p) | = | p Log(p) | (10) |

G(p) wants to scale linearly with p, like L(p), but it is thwarted by the slowly increasing logarithm in the denominator. See Figure 2 below. The form of scalability in G(p) is a bound situated between the best case linear scaling of L(p) and worst case Amdahl scaling of eqn.(2): A(p) << G(p) < L(p). In other words, the linear scalability of eqn.(4) can be defeated by poor

**load-balancing**of the workload, even though σ = κ = 0 in the USL model.

Remarkably, this

*upper*bound due to workload

**partitioning**is identical to Gauss' original estimate of the

*lower*bound on the distribution of the prime numbers. As far as I'm aware, nobody has spotted this analogy before. I've known about it since 1994.

**Simulations in R**

To get a better idea of what all this means, let's look at the preceding conclusions numerically and graphically, using R.

Here's how the first 10 terms of a harmonic series (e.g., p = 10 in eqn.(7)) can be constructed very simply in R.

> a<-c(1:10)

> a

[1] 1 2 3 4 5 6 7 8 9 10

> 1/a

[1] 1.0000000 0.5000000 0.3333333 0.2500000 0.2000000 0.1666667 0.1428571 0.1250000

[9] 0.1111111 0.1000000

> sum(1/a)

[1] 2.928968

We can check that the series summation is correct by comparing it to the **harmonic mean**, which is the

*inverse mean of the inverses*:

> 1/mean(1/a)

[1] 3.414172

> length(a)/sum(1/a)

[1] 3.414172

Next, we can simulate the elapsed time using all possible processor-subsets by choosing the m-configurations as random variates from a uniform probability distribution.

p <- 500

T1 <- 100 # uniprocessor execution time

Tp <- 0 # reduced multiprocessor time

runs <- 100*p # must be greater than p

set.seed(1)

m <- round(runif(runs,1,p)) # choose random subsets of p processors

hist(m,xlab="Subsets (m) of processors (p)",ylab="Activity distribution")

The resulting histogram shows the activity of processor subsets.**Figure 1. Histogram on processor activity**

Finally, we calculate the weighted harmonic sum corresponding to eqn.(6).

mf <- table(m) # does frequency counts

mm <- as.matrix(mf) # access counts by index

a <- c(1:p) # sequence of processors

Tp <- sum((T1/a)*(mm/runs)) # weighted harmonic sum

Sp <- T1/Tp # speedup

Gp <- p/log(p) # Gaussian bound

plot(xsim,ysim,xlab="Processors (p)",ylab='Effective speedup')

lines(x<-c(2:1000),x/log(x))

lines(x<-c(2:200),x,lty="dashed")

The following plot shows some simulation results (circles) compared with the Gaussian bound G(p) in eqn.(10). The dashed line corresponds to L(p) in eqn.(4).

**Figure 2. Load balance bound**

In principle, this load-balancing degradation could be added as a

*non-polynomial*term in the denominator of the USL, but then it would no longer belong to the class of rational functions. Moreover, because the logarithm is a slowly diverging function (mathematically speaking),

**load imbalance**degradation will generally either be masked by any remaining weak contention and coherency effects or be difficult to distinguish from near-linear growth, in actual performance measurements.

To

**leave a comment**for the author, please follow the link and comment on his blog:**The Pith of Performance**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...