# The Long Tail of the Pareto Distribution

(This article was first published on ExploringDataBlog, and kindly contributed to R-bloggers)

In my last two posts, I have discussed cases where the mean is of little or no use as a data characterization.  One of the specific examples I discussed last time was the case of the Pareto type I distribution, for which the density is given by:

p(x) = aka/xa+1

defined for all x > k, where k and a are numeric parameters that define the distribution.  In the example I discussed last time, I considered the case where a = 1.5, which exhibits a finite mean (specifically, the mean is 3 for this case), but an infinite variance.  As the results I presented last time demonstrated, the extreme data variability of this distribution renders the computed mean too variable to be useful.  Another reason this distribution is particularly interesting is that it exhibits essentially the same tail behavior as the discrete Zipf distribution; there, the probability that a discrete random variable x takes its ith value is:

pi = A/ic,

where A is a normalization constant and c is a parameter that determines how slowly the tail decays.  This distribution was originally proposed to characterize the frequency of words in long documents (the Zipf-Estoup law), it was investigated further by Zipf in the mid-twentieth century in a wide range of applications (e.g., the distributions of city sizes), and it has become the subject of considerable recent attention as a model for “long-tailed” business phenomena (for a non-technical introduction to some of these business phenomena, see the book by Chris Anderson, The Long Tail).  I will discuss the Zipf distribution further in a later post, but one of the reasons for discussing the Pareto type I distribution first is that since it is a continuous distribution, the math is easier, meaning that more characterization results are available for the Pareto distribution.

The mean of the Pareto type I distribution is:

Mean = ak/(a-1),

provided a > 1, and the variance of the distribution is finite only if a > 2.  Plots of the probability density defined above for this distribution are shown above, for k = 1 in all cases, and with a taking the values 0.5, 1.0, 1.5, and 2.0.  (This is essentially the same plot as Figure 4.17 in Exploring Data in Engineering, the Sciences, and Medicine, where I give a brief description of the Pareto type I distribution.)  Note that all of the cases considered here are characterized by infinite variance, while the first two (a = 0.5 and 1.0) are also characterized by infinite means.  As the results presented below emphasize, the mean represents a very poor characterization in practice for data drawn from any of these distributions, but there are alternatives, including the familiar median that I have discussed previously, along with two others that are more specific to the Pareto type I distribution: the geometric mean and the harmonic mean.

The plot below emphasizes the point made above about the extremely limited utility of the mean as a characterization of Pareto type I data, even in cases where it is theoretically well-defined.  Specifically, this plot compares the four characterizations I discuss here – the mean (more precisely known as the “arithmetic mean” to distinguish it from the other means considered here), the median, the geometric mean, and the harmonic mean – for 1000 statistically independent Pareto type I data sequences, each of length N = 400, with parameters k = 1 and a = 2.0.  For this example, the mean is well-defined (specifically, it is equal to 2), but compared with the other data characterizations, its variability is much greater, reflecting the more serious impact of this distribution’s infinite variance on the mean than on these other data characterizations.

To give a more complete view of the extreme variability of the arithmetic mean, boxplots of 1000 statistically independent samples drawn from all four of the Pareto type I distribution examples plotted above are shown in the boxplots below.  As before, each sample is of size N = 400 and the parameter k has the value 1, but here the computed arithmetic means are shown for the parameter values a = 0.5, 1.0, 1.5, and 2.0; note the log scale used here because the range of computed means is so large.  For the first two of these examples, the population mean does not exist, so it is not surprising that the computed values span such an enormous range, but even when the mean is well-defined, the influence of the infinite variance of these cases is clearly evident.  It may be argued that infinite variance is an extreme phenomenon, but it is worth emphasizing here that for the specific “long tail” distributions popular in many applications, the decay rate is sufficiently slow for the variance – and sometimes even the mean – to be infinite, as in these examples.

As I have noted several times in previous posts, the median is much better behaved than the mean, so much so that it is well-defined for any proper distribution.  One of the advantages of the Pareto type I distribution is that the form of the density function is simple enough that the median of the distribution can be computed explicitly from the distribution parameters.  This result is given in the fabulous book by Johnson, Kotz and Balakrishnan that I have mentioned previously, which devotes an entire chapter (Chapter 20) to the Pareto family of distributions.  Specifically, the median of the Pareto type I distribution with parameters k and a is given by:

Median = 21/ak

Thus, for the four examples considered here, the median values are 4.0 (for a = 0.5), 2.0 (for a = 1.0), 1.587 (for a = 1.5), and 1.414 (for a = 2.0).  Boxplot summaries for the same 1000 random samples considered above are shown in the plot below, which also includes horizontal dotted lines at these theoretical median values for the four distributions.  The fact that these lines correspond closely with the median lines in the boxplots gives an indication that the computed median is, on average, in good agreement with the correct values it is attempting to estimate.  As in the case of the arithmetic means, the variability of these estimates decreases monotonically as a increases, corresponding to the fact that the distribution becomes generally better-behaved as the a parameter increases.

The geometric mean is an alternative characterization to the more familiar arithmetic mean, one that is well-defined for any sequence of positive numbers.  Specifically, the geometric mean of N positive numbers is defined as the Nth root of their product.  Equivalently, the geometric mean may be computed by exponentiating the arithmetic average of the log-transformed values.  In the case of the Pareto type I distribution, the utility of the geometric mean is closely related to the fact that the log transformation converts a Pareto-distributed random variable into an exponentially distributed one, a point that I will discuss further in a later post on data transformations.  (These transformations are the topic of Chapter 12 of Exploring Data, where I briefly discuss both the logarithmic transformation on which the geometric mean is based and the reciprocal transformation on which the harmonic mean is based, described next.)   The key point here is that the following simple expression is available for the geometric mean of the Pareto type I distribution (Johnson, Kotz, and Balakrishnan, page 577):

Geometric Mean = k exp(1/a)

For the four specific examples considered here, these geometric mean values are approximately 7.389 (for a = 0.5), 2.718 (for a = 1.0), 1.948 (for a = 1.5), and 1.649 (for a = 2.0).  The boxplots shown below summarize the range of variation seen in the computed geometric means for the same 1000 statistically independent samples considered above.  Again, the horizontal dotted lines indicate the correct values for each distribution, and it may be seen that the computed values are in good agreement, on average.  As before, the variability of these computed values decreases with increasing a values as the distribution becomes better-behaved.

The fourth characterization considered here is the harmonic mean, again appropriate to positive values, and defined as the reciprocal of the average of the reciprocal data values.  In the case of the geometric mean just discussed, the log transformation on which it is based is often useful in improving the distributional character of data values that span a wide range.  In the case of the Pareto type I distribution – and a number of others – the reciprocal transformation on which the harmonic mean is based also improves the behavior of the data distribution, but this is often not the case.  In particular, reciprocal transformations often make the character of a data distribution much worse: applied to the extremely well-behaved standard uniform distribution, it yields the Pareto type I distribution with a = 1, for which none of the integer moments exist; similarly, applied to the Gaussian distribution, the reciprocal transformation yields a result that is both infinite variance and bimodal.  (A little thought suggests that the reciprocal transformation is inappropriate for the Gaussian distribution because it is not strictly positive, but normality is a favorite working assumption, sometimes applied to the denominators of ratios, leading to a number of theoretical difficulties.  I will have more to say about that in a future post.)  For the case of the Pareto type I distribution, the reciprocal transformation converts it into the extremely well-behaved beta distribution, and the harmonic mean has the following simple expression:

Harmonic mean = k(1 + a-1)

For the four examples considered here, this expression yields harmonic mean values of 3 (for a = 0.5), 2 (for a = 1.0), 1.667 (for a = 1.5), and 1.5 (for a = 2.0).  Boxplot summaries of the computed harmonic means for the 1000 simulations of each case considered previously are shown below, again with dotted horizontal lines at the theoretical values for each case.  As with both the median and the geometric mean, it is clear from these plots that the computed values are correct on average, and their variability decreases with increasing values of the a parameter.

The key point of this post has been to show that, while averages are not suitable characterizations for “long tailed” phenomena that are becoming an increasing subject of interest in many different fields, useful alternatives do exist.  For the case of the Pareto type I distribution considered here, these alternatives include the popular median, along with the somewhat less well-known geometric and harmonic means.  In an upcoming post, I will examine the utility of these characterizations for the Zipf distribution.