**Peter's stats stuff - R**, and kindly contributed to R-bloggers)

## Data that’s been only partially observed

I’ve been updating my skills in fitting models to truncated data and was pleased to find that, like so much else in statistics, it’s easier than it used to be.

First, some definitions:

*censored*data is where some observations are cut off at some maximum or minimum level; those data points are “off the scale” but at least we know they exist and we know which direction they are off the scale. For example, if we were analysing the life span of people born in 1980, anyone who has survived to the end of 2017 has their age at death recorded as “37 or higher”. We know they’re in the data, and their value is at least some minimum amount, but we don’t know with precision what it will end up being.*truncated*data is where data beyond some maximum or minimum level is just missing. Typically this is because of some feature of the measurement process eg anything smaller than X just doesn’t show up.

I’ve got some future blog posts on a more substantive real life issue where I have count data for which, in some situations, I only see the observations with a count higher than some threshold. Let’s imagine, for example, we are looking at deaths per vehicle crash, and are dependent for measurement on newspapers that only report crashes with at least two deaths, even though many crashes have one or zero deaths.

Here’s a greatly simplified example. I generate 1,000 observations of counts, with an average value of 1.3. Then I compare that original distribution with what I’d get if only those of two or higher were observed. It looks like this:

…generated by this code:

Estimating the key parameter `lambda`

for the full data (`a`

) works well, giving an estimate of 1.347 that is just over one standard error from the true value of 1.3. The `fitdistr`

function from the `MASS`

package distributed with base R does a nice job in such circumstances.

But the mean value of `b`

is badly biased upwards if used to estimate `lambda`

; at 2.6 the mean of `b`

is roughly twice the correct value of the mean of the underlying distribution. Obviously, removing a whole bunch of data at one end of the distribution is going to make naive estimation methods biased. So we need specialist methods that try to estimate lambda on the assumption that the data come from a Poisson distribution, but only the right-most part of it.

## Maximum likelihood

The `fitdistrplus`

package by Aurélie Siberchicot, Marie Laure Delignette-Muller and Christophe Dutang in combination with `truncdist`

by Frederick Novomestky and Saralees Nadarajah gives a straightforward way to implement maximum likelihood estimation of a truncated distribution. Methods other than maximum likelihood are also available if required.

You need to make truncated versions of the `dpois`

and `ppois`

functions (or their equivalents for whatever distribution you are modelling) and use these within `fitdistrplus::fitdist`

, which has some added functionality over `MASS::fitdistr`

used in the previous chunk of code.

Note that to do this I specified the lower threshold as 1.5; as all the data are integers this effectively means we only observe the observations of 2 or more, as is the case. We also needed to specify a reasonably starting value for the estimate of `lambda`

– getting this too far out will lead to errors.

This method gives us an estimate of 1.34 with a standard error of 0.08, which is pretty good given we’ve only got 398 observations now. Of course, we’ve got the luxury of knowing for sure the true data generating process was Poisson.

## Bayesian

For an alternative Bayesian method, Stan makes it easy to describe data and probability distributions as truncated. The Stan manual has an entire chapter on truncated or censored data. Here’s an example Stan program to estimate the mean of the original Poisson distribution from our truncated data. As well as the original data, which I call `x`

in this program, we need to tell it how many observations (`n`

), the `lower_limit`

that it was truncated by, and whatever is needed to characterise a prior distribution for the parameter we’re estimating.

The key bits of the program below are:

- In the
`data`

chunk, specify that the data for`x`

has a lower limit of`lower_limit`

- In the
`model`

chunk, specify that distribution of`x`

is truncated via`T[lower_limit, ]`

With a little more effort it’s possible to extend this by making Stan estimate `lower_limit`

from the data; not necessary in my hypothetical example because I know where the minimum cut-off point of observed data lies.

Here’s how the data are fed to Stan from R:

This gives us a posterior distribution for `lambda`

that matches that from the `fitdistrplus`

method: 1.35 with a standard deviation of 0.08. The `rstan`

package automatically turns this into a `ggplot2`

image of a credibility interval:

So, nice. Two simple ways to estimate the original distribution from truncated data.

**leave a comment**for the author, please follow the link and comment on their blog:

**Peter's stats stuff - R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...