[This article was first published on Back Side Smack » R Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here’s a post generated from my own ignorance of statistics (as opposed to just being marred by it)! In Labor Economics we walked through something called the truncated normal distribution. Truncated distributions come up a lot in the sciences because you may have some sample from a large population which is normall distributed but the sample itself is selected only from a certain range. If you have a sample of college students you shouldn’t expect them to reflect the population of 18-24 year olds simply because some 18 year olds choose not to attend college. Or if you are sampling apples at the grocery store you are looking at a non-randomly selected subset of all apples in the world because the tiny ones get turned into applesauce. The problems are (to use a graduate school phrase) non-trivial. As we will see below if you assume a truncated distribution is just a normal well behaved distribution you will incorrectly estimate the mean and underestimate the variance.

But I’m actually interested in a much narrower issue. There are ways around dealing with truncated distributions and they begin with estimating just how much of the distribution was cut off. If we know certain things about the original distribution and we know where the truncation point was, we can compute what the new mean and variance ought to be. What do we mean by certain things? First we want a scale-free measure of the truncation point.

• $alpha = (a-mu)/sigma$

Where $a$ is the actual truncation point and $mu$, $sigma$ are the mean and standard deviation of the normal distribution. $alpha$ then becomes the scaled (standardized, if you will) truncation point. Once we have a scale free truncation point we can begin to work at an estimate of how much of the distribution has been cut off. In order to do this we need to compute something called the inverse Mills ratio. The inverse Mills ratio is commonly but not exclusively associated with truncated distributions. We are going to follow Heckman and denote the inverse Mills ratio with $lambda(alpha)$. For our one-sided truncation:

• $lambda(alpha) = phi(alpha)/[1-Phi(alpha)]$

Where $phi(alpha), Phi(alpha)$ are the pdf and CDF of the normal distribution, respectively. Compute this guy and the expected value of the truncated distribution is just one short step away. Specifically,

So I cheated. Specifically, I bootstrapped a hundred normally distributed samples and compared the computed conditional mean to the actual conditional mean. What did this look like in practice? It’s actually kind of pretty:

Because I created each sample and each truncated subset of those samples I know their means. I also know the parameters of the normal distribution used to generate the sample. So I can test my theory. Are the $phi(alpha), Phi(alpha)$ functions supposed to use the original standard deviation or a standard deviation of 1? Lets see.

What we see above are all of the expected and actual means of the truncated distribution. The two distributions centered around 2 are the true means and the conditional mean computed with a pdf assuming a standard deviation of 1. The distribution lagging around 0.5 was created assuming a standard deviation of 4 for the inverse Mills ratio. We have a clear winner!

Knowing the method to calculate a truncated distribution at first seems like quite a feat. But there are immediate practical problems to estimating truncated samples in the wild. In my bootstrapping example above I already knew the parameters of the data generating process for the population. In the real world we often don’t observe all of the features of the population. Perhaps for college students versus young adults we can rely on cross sectional surveys but for other examples we have no such out. Imagine attempting to estimate a labor supply problem (a la Heckman). People might respond to wage changes by choosing to work more or less, but how do you deal with people who choose to work 0 hours? Students, retirees, spouses, hippies, they all work 0 hours, but assuming that their decision to work 0 hours represents a choice of 0 given some wage rather than a choice not to work given some wage would be the same as assuming that all the apples in the grocery store are randomly selected on the basis of size. Those people are staying out of the labor force because (in a very reductionist manner) their reservation wages are higher than the offered wage, but you have no idea what that reservation wage actually is. So you don’t know the true mean or the truncation point. There are a few methods to recover an estimate of the reservation wage and therefore compute the characteristics of the truncated distribution of earnings but they all are more complex than a few lines of code linking the parameters of a distribution to an answer. But if it were easy they wouldn’t call it baseball.

R code is below: