Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

NOTE: Annoyingly, the remote mathjax server often takes it’s sweet time rendering LaTex equations (like, up to a minute or more!!!). I don’t know if this is deliberate on the part of Google or a bug. It used to be faster. If anyone knows, I’d be interested to hear; especially if there is a way to speed it up. And no, I’m not planning to move to WordPress.

### The 2-parameter USL model

The original USL model, presented in my GCAP book and updated in the blog post How to Quantify Scalability, is defined in terms of two fitting parameters $\alpha$ (contention) and $\beta$ (coherency). $$X(N) = \frac{N \, X(1)}{1 + \alpha (N – 1) + \beta N (N – 1)} \label{eqn: usl2}$$

Fitting this nonlinear USL equational model to data requires several steps:

1. normalizing the throughput data, $X$, to determine relative capacity, $C(N)$.
2. equation (\ref{eqn: usl2}) is equivalent to $X(N) = C(N) \, X(1)$.
3. if the $X(1)$ measurement is missing or simply not available—as is often the case with data collected from production systems—the GCAP book describes an elaborate technique for interpolating the value.
The motivation for a 2-parameter model arose out of a desire to meet the twin goals of:
1. providing each term of the USL with a proper physical meaning, i.e., not treat the USL like a conventional multivariate statistical model (statistics is not math)
2. satisfying the von Neumann criterion: minimal number of modeling parameters
Last year, I realized the 2-paramater constraint is actually overly severe. Introducing a third parameter would make the statistical fitting process even more universal, as well as simplify the overall procedure. For the USL particularly, the von Neumann criterion should not be taken too literally. It’s really more of a guideline: fewer is generally better. Additionally, Baron Schwarz told me that he’d had better luck fitting production RDBMS data in Excel by substituting a third parameter into the numerator of the USL. As ever, the question remained: How could this actually work?

### The 3-parameter USL model

Going back to equation (\ref{eqn: usl2}), let’s just consider the simplest case where scaling is linear-rising, as would be the case for ideal parallelism. In the linear region, where $\alpha = \beta = 0$, equation (\ref{eqn: usl2}) simplifies to $$X(N) = N \, X(1) \label{eqn: usl1}$$

In other words, the overall throughput $X(N)$ increases in simple proportion to $N$. The “single-user” throughput, $X(1)$, doesn’t change and therefore acts like a constant of proportionality.

But what happens when we don’t know the value of $X(1)$? That means the $X(1)$ factor in equations (\ref{eqn: usl2}) and (\ref{eqn: usl1}) is undefined. We might denote this situation by writing

$$X(N) = N \, ? \label{eqn: uslx}$$

Of course, that makes no sense, mathematically speaking. As already mentioned, the conventional way out of this situation is to estimate the value of $X(1)$ using mathematical interpolation. But now, the epiphany.

Rather than using the more complicated interpolation procedure, we can simply appeal to statistical regression! Yes, that’s right, we can treat the USL equation purely as a conventional multivariate statistical model. After all, we’re already using nonlinear statistical regression to determine the $\alpha$ and $\beta$ parameters. More importantly, since statistics is not math, we can replace equation ($\ref{eqn: uslx}$) with a statement about correlation, rather than strict equality. In statistical models, that’s accomplished by introducing another parameter (I’ll call it $\gamma$, since that’s the third letter of the Greek alphabet) to replace the question mark in equation ($\ref{eqn: uslx}$), namely

$$X(N) = N \, \gamma \label{eqn: uslg}$$

The new parameter $\gamma$ is just a constant of proportionality that represents the slope of the line associated with ideal parallel scaling. See the plots below.

And here’s a little piece of magic. If we choose $N = 1$ in equation ($\ref{eqn: uslg}$), it becomes $X(1) = \gamma$. So, when the $\gamma$ parameter is determined by statistical regression, it also tells us the estimated value of $X(1)$, whether it was measured or missing. In other words, we don’t need to do any explicit interpolation because the nonlinear regression procedure does it automatically by fitting the third parameter.

Equation (\ref{eqn: usl2}) is now replaced by a 3-parameter version of the USL model: $$X(N) = \frac{N \, \gamma}{1 + \alpha (N – 1) + \beta N (N – 1)} \label{eqn: usl3}$$

Unlike the 2-parameter USL, equation (\ref{eqn: usl3}) can be fitted directly to your measured throughput data without the need to do data normalization or interpolation. The following examples show the results of fitting the 3-parameter USL model.

These are load-test data and the “single-user” throughput was measured as $X(1) = 955.16$ per unit time. The 3-parameter USL fit is summarized in the following plot.

The fitted value of $\gamma = 995.65$, which is the estimated value of $X(1)$. It can also be regarded as the slope of the linear-rising throughput indicated by the sloping red line on the left of the plot.

### Production data

These data are from a continuously running production system and thus, no $X(1)$ was ever produced.

The fitted value of $\gamma = 3.22$ is also equivalent to the estimated value of $X(1)$. Similarly, it can be regarded as the slope of the linear-rising throughput on the left of the plot. Interestingly, in these data, $\alpha = 0$, while $beta$ is non-zero. That suggests there is no significant contention in the workload but there is some data exchange coherency at play.

One word of caution. Fitting the 3-parameter USL can be more sensitive to the actual data, especially with a large number of production data scatter points. I’ll go into all this, and more, in the upcoming Guerrilla training classes.