Data that’s been only partially observed
I’ve been updating my skills in fitting models to truncated data and was pleased to find that, like so much else in statistics, it’s easier than it used to be.
First, some definitions:
- censored data is where some observations are cut off at some maximum or minimum level; those data points are “off the scale” but at least we know they exist and we know which direction they are off the scale. For example, if we were analysing the life span of people born in 1980, anyone who has survived to the end of 2017 has their age at death recorded as “37 or higher”. We know they’re in the data, and their value is at least some minimum amount, but we don’t know with precision what it will end up being.
- truncated data is where data beyond some maximum or minimum level is just missing. Typically this is because of some feature of the measurement process eg anything smaller than X just doesn’t show up.
I’ve got some future blog posts on a more substantive real life issue where I have count data for which, in some situations, I only see the observations with a count higher than some threshold. Let’s imagine, for example, we are looking at deaths per vehicle crash, and are dependent for measurement on newspapers that only report crashes with at least two deaths, even though many crashes have one or zero deaths.
Here’s a greatly simplified example. I generate 1,000 observations of counts, with an average value of 1.3. Then I compare that original distribution with what I’d get if only those of two or higher were observed. It looks like this:
…generated by this code:
Estimating the key parameter
lambda for the full data (
a) works well, giving an estimate of 1.347 that is just over one standard error from the true value of 1.3. The
fitdistr function from the
MASS package distributed with base R does a nice job in such circumstances.
But the mean value of
b is badly biased upwards if used to estimate
lambda; at 2.6 the mean of
b is roughly twice the correct value of the mean of the underlying distribution. Obviously, removing a whole bunch of data at one end of the distribution is going to make naive estimation methods biased. So we need specialist methods that try to estimate lambda on the assumption that the data come from a Poisson distribution, but only the right-most part of it.
fitdistrplus package by Aurélie Siberchicot, Marie Laure Delignette-Muller and Christophe Dutang in combination with
truncdist by Frederick Novomestky and Saralees Nadarajah gives a straightforward way to implement maximum likelihood estimation of a truncated distribution. Methods other than maximum likelihood are also available if required.
You need to make truncated versions of the
ppois functions (or their equivalents for whatever distribution you are modelling) and use these within
fitdistrplus::fitdist, which has some added functionality over
MASS::fitdistr used in the previous chunk of code.
Note that to do this I specified the lower threshold as 1.5; as all the data are integers this effectively means we only observe the observations of 2 or more, as is the case. We also needed to specify a reasonably starting value for the estimate of
lambda - getting this too far out will lead to errors.
This method gives us an estimate of 1.34 with a standard error of 0.08, which is pretty good given we’ve only got 398 observations now. Of course, we’ve got the luxury of knowing for sure the true data generating process was Poisson.
For an alternative Bayesian method, Stan makes it easy to describe data and probability distributions as truncated. The Stan manual has an entire chapter on truncated or censored data. Here’s an example Stan program to estimate the mean of the original Poisson distribution from our truncated data. As well as the original data, which I call
x in this program, we need to tell it how many observations (
lower_limit that it was truncated by, and whatever is needed to characterise a prior distribution for the parameter we’re estimating.
The key bits of the program below are:
- In the
datachunk, specify that the data for
xhas a lower limit of
- In the
modelchunk, specify that distribution of
xis truncated via
With a little more effort it’s possible to extend this by making Stan estimate
lower_limit from the data; not necessary in my hypothetical example because I know where the minimum cut-off point of observed data lies.
Here’s how the data are fed to Stan from R:
This gives us a posterior distribution for
lambda that matches that from the
fitdistrplus method: 1.35 with a standard deviation of 0.08. The
rstan package automatically turns this into a
ggplot2 image of a credibility interval:
So, nice. Two simple ways to estimate the original distribution from truncated data.