**The Pith of Performance**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This question arose while addressing Comments on a previous blog post about exponentially distributed delays. One of my ongoing complaints is that many, if not most, popular load-test generation tools do not provide exponential variates as part of a library of time delays or think-time distributions. Not only is this situation bizarre, given that all load tests are actually performance models (and who doesn’t love an exponential distribution in their performance models?), but without the exponential distribution you are less likely to observe such things as buffer overflow conditons due to larger than normal (or uniform) queueing fluctuations. Exponential delays are both simple and useful for that purpose, but we are often left to roll our own code and then debug it.

Listing 2.2 on p. 35 of my Perl::PDQ book shows you how to generate exponential variates in **Perl**. Here, however, I want to use R to compare exponential delays with both the *uniform distribution* (the **default** distribution available in all load-test simulators) and the *normal distribution* (the familiar “**bell curve**“).

The R function that generates exponential variates directly is `rexp(n, rate = 1)` where, for example, the parameter called `rate` might correspond to the arrival rate of requests going into your test rig or *system under test* (SUT). The R programming language uses the same notation as p. 57 of my Perl::PDQ book. In my books and classes, I usually write that rate as $\lambda$ to match conventional queueing theory symbology. The probability density function (PDF), or `dexp()` in R, is usually written as:

\begin{equation} f(t) = \lambda e^{-\lambda t} \end{equation}

and shown in Figure 1.

The rate is $\lambda$, but the average or statistical mean of (1) is given by the inverse rate or $1/\lambda$. Since $\lambda$ is the average arrival **rate**, $1/\lambda$ is the average interarrival **time** as would be seen by the SUT. The view from the load-test client corresponds to a think-time delay of $Z = 1/\lambda$ in your script. In either case, the delay is the time interval between requests, whether departing the client or arriving at the SUT. If you could apply the R function `rexp()` directly to produce 10 exponentially distributed delays with a mean time of $Z=30$ seconds, you would write `rexp(10,1/30)` with the result:

> rexp(10,1/30)

[1] 126.841780 78.137417 12.989666 31.544992 9.598437 89.299508 2.063628 4.597275

[9] 24.792890 11.950121

Note that some delays are much smaller than the mean while other delays are much greater. Unfortunately, this R function is not available to you in load-test scripts so, you have to code your own. Here’s how that works.

**Inverse Transformation**

In eqn. (1), we have the output $f(t)$ on the left and the corresponding delay $(t)$ on the right side (in the exponent). However, we would really prefer to have things the other way around: flip a coin to get an *input* on the right and find out what delay that corresponds to as an *output* on the left.

We can use the *inverse transform* to do precisely that. The inverse function does not necessarily exist for an arbitrary probability distribution but, thankfully, the exponential distribution has a very simple form which allows it. The inverse of the exponential function is the **natural logarithm** function. The PDF in (1) lies in the range $0 \le f < \lambda$ on the $y$-axis, but we need to work with *probabilities*. The function which does this is the *cumulative distribution function* $F(t)$ in Figure 2:

\begin{equation} F(t) = 1 – e^{-\lambda t} \end{equation}

which is strictly bounded by the range $0 \le F < 1$. $F(t)$ is the corresponding **area** under $f(t)$ and corresponds to `pexp(q, rate = 1)` in R.

Typically, we would look along the $t$-axis (horizontal) for a particular time $(t)$ and then look up (to the curve) and across to the y-axis $(F)$ to find out the probability of that time occurring. Here, instead, we pick a random point on y-axis interval corresponding to $F$ (e.g., by flipping a coin). The corresponding delay is read off from the t-axis by following the dashed arrow in Figure 2, which shows this *inversion* process for probability values $0.90$, $0.80$ and $0.30$. BTW, those probability values also correspond respectively to $90$th, $80$th, and $30$th percentiles, if you prefer to think of them that way. We can simulate the coin flip by using a variate $u \sim U(0,1)$ chosen from a **uniform** distribution $0 \le u < 1$.

Based on Figure 2, how can we calculate the corresponding interarrival delay $(t)$ that the load generator should use? Letting $u$ represent $F$ in (2) and transposing produces:

\begin{equation} e^{-\lambda t} = 1 – u \end{equation}

Next, we solve (3) for $t$ by taking natural logs of both sides—the **inverse function**:

\begin{equation} \lambda t = – \ln(1 – u) \end{equation}

But the value of $u$ lies in the same interval as $(1-u)$, since they have the same uniform distribution. Hence, we can use the slightly simpler form:

\begin{equation} t = – \frac{\ln(u)}{\lambda} \end{equation}

Finally, we have arrived at the place where we wanted to be: flip a coin to get a **random** *input* on the right hand side of (5) and find out what delay the **client script** should use as an *output* on the left. For load testing, the random delay $(t)$ is associated with a mean think time $Z = 1/\lambda$ and is therefore computed using:

\begin{equation} t = -Z \ln(u) \end{equation}

Equation (6) is what `rexp()` uses under the covers, and it’s what you need to code in your client test scripts. A rather simple formula which, again, underscores the lunacy of not having it integrated into the load-test simulator.

**Examples in R**

Using R, we first generate $10$ random variates (coin tosses) from a **uniform** distribution:

> uVariates <- runif(10) # create $10$ uniform variates

> uVariates

[1] 0.13737362 0.23143301 0.59315715 0.56576222 0.48420383 0.66065404 0.77418521 0.07673259

[9] 0.39985820 0.99247847

Next, we use these values as inputs into eqn. (6), with a mean time of 30 seconds, to calculate 10 **exponentially distributed delays**:

> thinkZ <- 30 # specify the average desired delay

> tDelays <- -thinkZ*log(uVariates) # see eqn. (6)

> tDelays

[1] 59.5515274 43.9039444 15.6688773 17.0874419 21.7574800 12.4357489 7.6783242 77.0228628

[9] 27.4993587 0.2264988

Note the spread of delay times, which would also create significant fluctuations in queue depth as seen by buffers on the SUT side.

For comparison, here are $10$ delay samples produced by a *uniform distribution* with the same mean as used for the exponential samples, i.e., the arithmetic mean $\frac{0+60}{2}=30$ seconds:

> runif(10,0,60)

[1] 25.90247 51.94069 30.43872 10.11337 29.17908 25.90095 31.91967 42.46816 49.94629 34.62311

Similarly, here are $10$ delay samples produced by a *normal distribution* with a mean of $30$ seconds:

> rnorm(10,30)

[1] 30.57623 30.63393 30.45814 29.40551 29.59376 30.38025 29.91399 30.12652 29.67296 29.00941

Clearly, the exponential distribution produces a greater spread of delay times.

**Related Posts**

**leave a comment**for the author, please follow the link and comment on their blog:

**The Pith of Performance**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.