# Relative error distributions, without the heavy tail theatrics

September 19, 2016
By

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Nina Zumel prepared an excellent article on the consequences of working with relative error distributed quantities (such as wealth, income, sales, and many more) called “Living in A Lognormal World.” The article emphasizes that if you are dealing with such quantities you are already seeing effects of relative error distributions (so it isn’t an exotic idea you bring to analysis, it is a likely fact about the world that comes at you). The article is a good example of how to plot and reason about such situations.

I am just going to add a few additional references (mostly from Nina) and some more discussion on log-normal distributions versus Zipf-style distributions or Pareto distributions.

## The theory

In analytics, data science, and statistics we often assume we are dealing with nice or tightly concentrated distributions such as the normal or Gaussian distribution. Analysis tends to be very easy in these situation and not require much data. However, for many quantities of interest (wealth, company sizes, sales, and many more) it becomes obvious that we cannot be dealing with such a distribution. The telltale sign is usually when relative error is more plausible than absolute error. For example it is much more plausible we know our net worth to within plus or minus 10% than to within plus or minus \$10.

In such cases you have to deal with the consequences of slightly more wild distributions such as at least the log-normal distribution. In fact this is the important point and I suggest you read Nina’s article for motivation, explanation, and methods. We have found this article useful both in working with data scientists and in working with executives and other business decision makers. The article formalizes ideas all of these people already “get” or anticipate into concrete examples.

In addition to trying to use mathematics to make things more clear, there is a mystic sub-population of mathematicians that try to use mathematics to make things more esoteric. They are literally disappointed when things make sense. For this population it isn’t enough to see if switching from a normal to log-normal distribution will fix the issues in their analysis. They want to move on to even more exotic distributions such as Pareto (which has even more consequences) with or without any evidence of such a need.

The issue is: in a log-normal distribution we see rare large events much more often than in a standard normal distribution. Modeling this can be crucial as it tells us not to be lulled into to strong a sense of security by small samples. This concern can be axiomatized into “heavy tailed” or “fat tailed” distributions, but be aware: these distributions tend to be more extreme than what is implied by a relative error model. The usual heavy tail examples are Zipf-style distributions or Pareto distributions (people tend to ignore the truly nasty example the Cauchy distribution, possibly because it dates back the 17th century and thus doesn’t seem hip).

The hope seems to be that one is saving the day by brining in new esoteric or exotic knowledge such as fractal dimension or Zipf’s law. The actual fact is this sort of power-law structure has been know for a very long time under many names. Here are some more references:

Reading these we see that the relevant statistical issues have been well known since at least the 1920’s (so were not a new discovery by the later loud and famous popularizers). The usual claim of old wine in new bottles is that there is some small detail (and mathematics is a detailed field) that is now set differently. To this I put forward a quote from Banach (from Adventures of a Mathematician S.M. Ulam, University of California Press, 1991, page 203):

Good mathematicians see analogies between theorems or theories, the very best ones see analogies between analogies.

Drowning in removable differences and distinctions is the world of the tyro, not the master.

The apparent simplicity of the distribution is an artifact of how the distribution is plotted. The standard method for visualizing the word frequency distribution is to count how often each word occurs in a corpus, and sort the word frequency counts by decreasing magnitude. The frequency f(r) of the r’th most frequent word is then plotted against the frequency rank r, yielding typically a mostly linear curve on a log-log plot (Zipf, 1936), corresponding to roughly a power law distribution. This approach— though essentially universal since Zipf—commits a serious error of data visualization. In estimating the frequency-rank relationship this way, the frequency f(r) and frequency rank r of a word are estimated on the same corpus, leading to correlated errors between the x-location r and y-location f(r) of points in the plot.

## An Example

Let us work through this one detailed criticism using R (all synthetic data/graphs found here). We start with the problem and a couple of observations.

Suppose we are running a business and organize our sales data as follows. We compute what fraction of our sales each item is (be it a count, or be it in dollars) and then rank them (item 1 is top selling, item 2 is next, and so on).

The insight of the Pareto-ists and Zipfians is if we plot sales intensity (probability or frequency) as a function of sales rank we are in fact very likely to get a graph that looks like the following:

Instead of all items selling at the same rate we see the top selling item can often make up a signficant fraction of the sales (such as 20%). There are a lot of 80/20 rules based on this empirical observation.

Notice also the graph is fairly illegible, the curve hugs the axes and most of the visual space is wasted. The next suggestion is to plot on “log-log paper” or plot the logarithm of frequency as a function of logarithm of rank. That gives us a graph that looks like the following:

If the original data is Zipfian distributed (as it is in the artificial example) the graph becomes a very legible straight line. The slope of the line is the important feature of the distribution and is (in a very loose sense) the “fractal dimension” of this data. The mystics think that by identifying the slope you have identified some key esoteric fact about the data and can then somehow “make hay” with this knowledge (though they never go on to explain how).

Chris Anderson in his writings on the “long tail” (including his book) clearly described a very practical use of such graphs. Suppose instead of assuming the line on log-log plots is a consequence of something special, suppose it is a consequence of something mundane. Maybe graphs tend to look like this for catalogs, sales, wealth, company sizes, and so on. So instead of saying the perfect fit is telling us something, look at defects in fit. Perhaps they indicate something. For example: suppose something we are selling products online and something is wrong with a great part of our online catalogue. Perhaps many of the products don’t have pictures, don’t have good descriptions, or some other common defect. We might expect our rank/frequency graph to look more like the following:

What happened is after product 20 something went wrong. In this case (because the problem happened early at an important low rank) can see it, but it is even more legible on the log-log plot.

The business advice is: look for that jump, sample items above and below the jump, and look for a difference. As we said the difference could be no images on such items, no free shipping, or some other sensible business impediment. The reason we care is this large population of low-volume items could represent a non-negligible fraction of sales. Below is the theoretical graph if we fixed whatever is wrong with the rarer items and plotted sales:

From this graph we can calculate that the missing sales represent a loss of about 32% of revenue. If we could service these sales cheaply we would want them.

## The flaw in analysis

In the above I used a theoretical Zipfian world to generate my example. But suppose the world isn’t Zipfian (there are many situations where log-normal is a much more plausible situation). Just because the analyst wishes things were exotic (requiring their unique heroic contribution) doesn’t mean they are in fact exotic. Log-log paper is legible because it reprocesses the data fairly violently. As Piantadosi said: we may see patterns in such plots that are features of the analysis technique, and not features of the world.

Suppose the underlying sales dates is log-normal distributed instead of Zipfian distributed (a plausible assumption until eliminated). If we had full knowledge of every possible sale for all time we could make a log-log plot over all time and get the following graph.

What we want to point out is: this is not a line. The hook down at the right side means that rare items have far fewer sales than a Zipfian model would imply. It isn’t just a bit of noise to be ignored. This means when one assumes a Zipfian model one is assuming the rare items as a group are in fact very important. This may be true or may be false, which is why you want to measure such a property and not assume it one way or the other.

The above graph doesn’t look so bad. The honest empiricist may catch the defect and say it doesn’t look like a line (though obviously a quantitive test of distributions would also be called for). But this graph was plotting all sales over all time. We would never see that. Statistically we usually model observed sales as a sample drawn from this larger ideal sampling population. Let’s take a look at what that graph may look like. An example is given below.

I’ll confess, I’d have a hard time arguing this wasn’t a line. It may or may not be a line, but it is certainly not strong evidence of a non-line. This data did not come from a Zipfian distribution (I know I drew it from a log-normal distribution), yet I would have a hard time convincing a Zipfian that it wasn’t from a Zipfian source.

And this brings us back to Piantadosi’s point. We used the same sample to estimate both sales frequencies and sales ranks. Neither of those are actually known to us (we can only estimate them from samples). And when we use the same sample to estimate both, they necessarily come out very related due to the sampling procedure. Some of the biases seem harmless such as frequency monotone decreasing in rank (which is true for unknown true values). But remember: relations that are true in the full population are not always true in the sample. Suppose we had a peek at the answers and instead of estimating the ranks took them from the theoretical source. In this case we could plot true rank versus estimated frequency:

This graph is much less orderly because we have eliminated some of the plotting bias which was introducing its own order. There are still analysis artifacts visible, but that is better than hidden artifacts. For example the horizontal strips are items that occurred with the same frequency in our sample, but had different theoretical ranks. In fact our sample is size 1000, so the rarest frequency we can measures is 1/1000 which creates the lowest horizontal stripe. The neatness of the previous graph were dots standing on top of each other as we estimated frequency as function of rank.

We are not advocating specific changes, we are just saying the log-log plot is a fairly refined view, and as such many of its features are details of processing- not all correctly inferred or estimated features of the world. Again, for a more useful applied view we suggest Nina Zumel’s living in a log-normal world.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...