Data are censored when samples from one or both ends of a continuous distribution are cut off and assigned one value. Environmental chemical exposure datasets are often leftcensored: instruments can detect levels of chemicals down to a limit, underneath which you can’t say for sure how much of the chemical was present. The chemical may not have been present at all, or was present at a low enough level that the instrument couldn’t pick it up.
There are a number of methods of dealing with censored data in statistics. The crudest approach is to substitute a value for every censored value. This may be zero, the detection limit, the detection limit over the square root of two, or some other value. This method, however, biases the statistical results depending on what you choose to substitute.
A more nuanced alternative is to assume the data follows a particular distribution. In the case of environmental data, it often makes sense to assume a lognormal distribution. When we do this, while we may not know what the precise levels of every nondetect, we can at least assume that they follow a lognormal distribution.
In this article, we will walk through fitting a lognormal distribution to a censored dataset using the method of maximum likelihood.
Fitting a censored lognormal distribution
First, we need to generate a contrived dataset. Here we take 10000 samples from a lognormal distribution with a logmean of 5 and logstandard deviation of 1; call those values \( y \). Any value below \( L = 50 \) will be censored. Let \( y^{\ast} \) be the censored dataset in which every censored point is fixed to the censoring limit. Formally, \( y^{\ast} \) is defined as:
Here we generate a dataset in R:
The spike on the left represents the censored values. The detection frequency, or the proportion of detected values, is 0.8669.
First, let’s do a better job of defining the problem we’re trying to solve. We set out to fit a lognormal distribution to some censored data. Our fundamental problem here is that we want to find the right parameters (the mean and standard deviation) for a lognormal distribution that “fits”” the observed data. But how do you tell what the “right” parameters are, or what the best “fit” is? The maximum likelihood approach is to say “the best distribution is the one that has the highest likelihood of generating the data we’ve observed.” To actually find those parameters, you use an optimization algorithm that tries a bunch of different parameters and hones in on the one that defines a distribution with the highest likelihood of generating the observed data.^{1}
Let’s develop some intuition about how we are going to handle this censored data. With censored data, we don’t know the exact value of each datapoint that is below a censoring limit. The true value could be anywhere below the limit. To deal with this, we assume that the entire dataset is lognormally distributed, so even if we don’t know the exact value of the censored data, we know something about how those points are distributed.
Specifically, let \( P(x \vert \mu, \sigma) \) and \( D(x \vert \mu, \sigma) \) be the lognormal probability density function (PDF) and cumulative density function (CDF), respectively, taking as parameters a mean \( \mu \) and standard deviation \( \sigma \). We seek to find values for \( \mu \) and \( \sigma \) that maximize the likelihood function \( L(y^{\ast} \vert \mu, \sigma) \). A key assumption is that that the measurements in \( y^{\ast} \) are independent.
If the point is above the censoring limit, then its likelihood is defined by the lognormal PDF. If it below the censoring limit, we don’t know where the real value was exactly; all we know is that it’s less than \( L \). As such, its likelihood is the cumulative probability of a value being below \( L \) in a lognormal distribution with mean \( \mu \) and standard deviation \( \sigma \).
The total likelihood is found by multiplying all the individual likelihoods.^{2} Formally:
It’s often more convenient, especially for computers, to work with the loglikelihood function. It’s easy to get it, just take the log of both sides:
The R translation uses the built in dlnorm
and plnorm
functions for the lognormal PDF and CDF, choosing which one to use based on whether the data point is censored. It also checks the parameters to make sure they make sense: a negative standard deviation is invalid, so the function rejects that parameter set.
We use the maxLik
function from the maxLik
^{3} package to perform the estimation. We need to supply starting values for the parameters, so let’s have it start with a mean of 0 and standard deviation of 1.
The fitted distribution gets pretty close to recovering the correct mean and standard deviation. Censored data loses information, so in general you shouldn’t expect to get perfect estimates here.
Let’s plot the original, uncensored data \( y \) against the distribution fitted to the censored data to see how they compare visually:
The core idea with this overall approach is that we assume something about the distribution of the censored data points so that we can get more nuanced parameter estimates than if we replaced them with a constant, like zero or the detection limit.
Future posts will go into regression techniques with censored datasets.

In “generative” statistical modelling you assume that the data you have was generated by some definable process. The task of the modeller is to write down how they think the data was generated, this is your “model” of how the world works. Everything that follows (parameter fitting, inference) is based off of this core assumption. In our case, we assume the data was generated from a lognormal distribution; that’s our “model” of how the data came to be. As it happens, this is exactly where the data came from! ↩

This comes from the assumption that each measurement is independent. Recall that the probability of two independent events occurring is the product of the individual probabilities: \( P(A \, \mathrm{and} \, B) = P(A) \cdot P(B) \) ↩

Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443458. DOI 10.1007/s0018001002171. ↩
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...