“Count” data occur frequently in economics. These are simply data where the observations are integer-valued – usually 0, 1, 2, ……. . However, the range of values may be truncated (e.g., 1, 2, 3, ….).
To model data of this form we typically resort to distributions such as the Poisson, negative binomial, or variations of these. These variations may account for truncation or censoring of the data, or the over-representation of certain count values (e.g., the “zero-inflated” Poisson distribution).
Covariates (explanatory variables) can be included into the model by making the mean of the distribution a function of these variables. After all, that’s exactly what we do in a linear regression model.
If the “count” data form a time-series, then there are other issues that have to be taken into account.
However, the discrete distributions that we typically use have a number of limitations. The fact that the Poisson distribution is, of necessity, “equi-dispersed” (its variance equals its mean) is a big limitation. This leads us to consider distributions such as the negative binomial, in which he variance exceeds the mean. This enables us to model “over-dispersed” data, which are encountered frequently in practice.
The standard distributions are also limited in terms of what they can model in terms of distributional shapes. In particular, there are limitations on modal values in the data.
For instance, in the case of the Poisson distribution, these limitations are the following. If the parameter (λ) of the Poisson distribution is an integer, then there are two adjacent modes with equal modal height, at x = λ and x = λ-1. If lambda is non-integer, then there is a single mode at int(λ), the integer part of λ.
In the case of the negative binomial distribution, there is a single mode.
This suggests that standard discrete distributions of the type that we typically use to mode l”count” data will not be very satisfactory if our data exhibit multi-modality.
We need to look to alternative distributions.
Here’s an example of what I mean.
In an earlier post, I discussed some of my work involving the use of the so-called Hermite distribution, introduced by Kemp and Kemp (1965). As an example, I showed the distribution of data relating to the number of financial crises in various countries, as reproduced here:
You can see that, apart from being multi-modal, this empirical distribution is over-dispersed (its variance is approximately twice its mean).
In Giles (2010) I used the Hermite distribution, and various covariates, to model these data using maximum likelihood estimation.
The Hermite distribution can be generalized in various ways. Recently, Moriña et al. (2015) have released a terrific R package, called hermite, that makes it really easy to model “count data” using the Generalized Hermite distribution. We now have a convenient way of dealing with data that exhibit both over-dispersion and multi-modality.
I strongly recommend this new addition to R.
Giles, D. E., 2010. Hermite regression analysis of multi-modal count data. Economics Bulletin, 30(4), 2936–2945.
Kemp, C. D. and A. W. Kemp, 1965. Some properties of the ‘Hermite’ distribution. Biometrika, 52, 381-394.
Moriña, D,, M. Higueras, P. Puig, and M. Oliveira, 2015. Generalized Hermite distribution modelling with the R package hermite. The R Journal, 7(2), 263-274.