[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Caution / who should read this – although this blog post uses some data on reporting times for Covid-19 deaths in Florida, it’s not really about that issue at all. It’s a quick dip into how to model data that comes from a mixture of two or more different probability distributions.
Mixture distributions
I have thoughts about mixture distributions. Sometimes we get data that could be modelled simply with classical techniques if we make assumptions about its distribution which are mostly right. But the distribution is ‘contaminated’ with another distribution. Very small departures from your assumptions can cause major challenges which are not always visually obvious.
Consider this example, drawn from Rand Wilcox’s excllent Modern Statistics for the Social and Bahvaioral Sciences. A standard normal distribution has been contanimated with 10% of its data coming from a second distribution, also normal but with a much higher variance. Without very large samples and/or a particularly alert approach to the modelling, this sort of data can cause serious pitfalls for inference.
Exactly what the pitfalls are, I don’t have time to talk about right now, but they are serious and involve long tails and difficulties in estimating variance.
That example was drawn with this code, which also sets us up for application next in this post to a real life example that crossed my path a week ago:
Distribution of reporting delays for deaths in Florida
The real life example came from a Twitter thread by Jason Salemi on 28 August 2020, about the distribution of delays in reporting Covid-19 deaths in Florida. I can’t find the actual Twitter thread now, but the discussion in passing mentioned the interesting fact that the mean delay was noticeably smaller than the median. This is unusual for data that has a strict lower bound (report isn’t possible before the actual death) and no upper bound, because typically a long rightwards tail and a few outliers will drag the mean to noticeably higher value than the median.
The example has extra interest for me in thinking about the data generating process for this sort of reporting on deaths. Recently in Victoria it has become clear that there are (at least) two different ways Covid-19 deaths are being reported – via aged care, and everything else. The aged care deaths are characterised by appearing in the official statistics only at some delay (up to a month or more), and in big clumps. This looks like the sort of process that needs to be modelled by a mixture distribution.
Dr Salemi kindly forwarded me the actual Florida data, which looks like this:
The mean delay is 23.25 days and the median is 31. I’ve converted one freqency of minus one in the original data into zero to make my life simpler. That chart was produced with this R code:
It seemed to me from this visualisation that this was likely to be from a mixture of two distributions, but I was still surprised that the median and mean worked out this way. This prompted me to consider “is it easy to construct a mixture distribution, given just the overall median and mean?”
Brute force – fitting a mixture of two Poisson distributions based on just mean and median
It wasn’t hard to do this if I was prepared to assume both distributions contributing to the mixture are Poisson distributions. This means I have just three parameters to estimate – the lambda (mean and variance) of each Poisson distribution, and the proportion of the data that comes from the first of the two distributions.
If you have the mean of the first distribution and the proportion of data from that distribution, you can calculate the mean of the second that is needed to provide the given total mean. Because:
So:
So really we only have two degrees of freedom to worry about. By brute force we can make a grid of possible values of and , calculate the as above in a way that the overall mean is 23.25, and then use the probability mass functions of Poisson distributions to calculate the median of the resulting mixture directly. Then we just need to see which combinations of the parameters give the correct median (31.0).
It turns out there are not one but many mixtures of two Poisson distributions that will give you this combination of mean and median. Here is a summary of which combinations of parameters can give you this:
… and here are four examples of the actual fitted distributions. These all have the same mean and median, but differ in other characteristics
To do all this, I created two functions in R:
dpois2() calculates the density of a mixture of two Poisson distributions
median_pois2() returns the median of a mixture of two Poisson distributions
And then just calculated the median for all combinations of a grid of possible parameters as described above. Here’s the code that does that and produces those two charts:
Better approach - fitting mixtures in Stan using all the data
There are two serious shortfalls with the approach to modelling above:
Because I am using only the mean and the median summary statistics, I am discarding a lot of extra information in the original data
While Poisson distributions are easy to use because they are fully charaterised by a single parameter, there is no particular reason to think that they are a good model of the number of days death by which reporting is delayed.
Shortfall 1 suggests I need to find a way to estimate a good mixture of distributions that uses the full original dataset. Shortfall 2 suggests using a different distribution, but still one that works with integers; negative binomial is my usual go-to here.
To address my challenges in easy steps, I started by fitting a mixture of two Poisson distributions to my delays data, using the full data not just the summary statistics. Here’s the simple program in Stan that sets out this model
Running this from R is a one-liner, and it quickly comes back with an estimate that the data could come from two Poisson distributions with means of 3.4 and 35.4, with 38% of the data coming from the first of these. This isn’t hugely different from my cruder estimates based on just the mean and median, but noticeably the mean of the first distribution is now quite a bit higher. Results from Stan are just pasted into the code below as comments, for reference.
What about addressing shortfall 2? It’s straightforward to adapt my Stan program to work with a mixture of K negative binomial distributions. Note that there are two different popular ways to characterise a negative binomial distribution, I am choosing the method that specifies the mean and a dispersion parameter:
Fitting this gives me a mixture with 40% of the data from a negative binomial distribution with (mean, size) parameters of (4.8, 0.77); and 60% from a distribution with (mean, size) of (35.8, 13.7). Closing the circle, if I simulate data in R from that mixture, this is what it looks like:
It’s noticeably not a perfect fit to the original Florida data, suggesting that even this model is perhaps under-fit and more complex things are going on. Also that more data (as always) would be helpful. But I think I’m onto a good approach. As an aside, I find it extraordinary how well Stan works in this situation.
Here’s the final R code for controlling the Stan program with the negative binomial mixture, and simulating data from the result. Again, results from Stan are just pasted into the code below as comments, for reference.
Related
To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.