Generating data from a truncated distribution

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A researcher reached out to me the other day to see if the simstudy package provides a quick and easy way to generate data from a truncated distribution. Other than the noZeroPoisson distribution option (which is a very specific truncated distribution), there is no way to do this directly. You can always generate data from the full distribution and toss out the observations that fall outside of the truncation range, but this is not exactly efficient, and in practice can get a little messy. I’ve actually had it in the back of my mind to add something like this to simstudy, but have hesitated because it might mean changing (or at least adding to) the defData table structure.

However, it may be time to go for it. The process and coding are actually relatively straightforward, so there is no real reason not to. I was developing a simple prototype for several probability distributions (though the concept can easily be applied to any distribution where the cumulative distribution function, or CDF, is readily accessible), and am sharing here in case you need to do this before it is available in the package, or if you just want to implement yourself.

What is a truncated distribution?

A truncated probability distribution is one derived from limiting the domain of an existing distribution. A picture is worth a thousand words. On the left, we have a histogram for 10,000 observations drawn from a full (non-truncated) Gaussian or normal distribution with mean 0 and standard deviation 3. In the middle, the histogram represents data drawn from the positive portion of the same distribution (i.e. is truncated at the left by 0). And on the far right, the truncation is defined by the boundaries \((-3, 3.5)\):

Leveraging the uniform distribution and a CDF

A while back, I described a copula approach to generating correlated data from different distributions (ultimately implemented in functions genCorGen and addCorGen). I wrote about combining a draw from a uniform distribution with the CDF of any target distribution to facilitate random number generation from the target generation. This is an approach that works well for truncated distributions also, where the truncated distribution is the target.

Again – visuals help to explain how this works. To start, here are several CDFs of normal distributions with different means and variances:

The CDF of a distribution (usually written as \(F(x)\)) effectively defines that distribution: \(F(x) = P(X \le x)\). Since probabilities by definition range from \(0\) to \(1\), we know that \(F(x)\) also ranges from \(0\) to \(1\). It is also the case, that \(F(x)\) is monotonically increasing (or at least non-decreasing) from \(0\) to \(1\).

Let’s say we want to generate a draw from \(N(\mu =0, \sigma = 3)\) using the the CDF. We can first generate a draw from \(u = Uniform(0,1)\). We then treat \(u\) as a value of the CDF, and map it back \(x\) to get our draw from the target distribution. So, \(x = F^{-1}(u)\). In R, the CDF for the normal distribution can be determined using the qnorm function, where the first argument is a probability value between \(0\) and \(1\). This would be the R code to generate a single draw from \(N(0, 3)\) using a random draw from \(Uniform(0, 1)\):

(u <- runif(1))

## [1] 0.9

qnorm(u, mean = 0, sd = 3)

## [1] 3.9

This is how \(u = 0.9\) relates to the draw of \(x=3.9\):

To generate a random sample of 10,000 draws from \(N(0, 3)\), this process is replicated 10,000 times:

library(ggplot2)

u <- runif(10000)
x <- qnorm(u, mean = 0, sd = 3)

ggplot(data = data.frame(x), aes(x = x)) +
  geom_histogram(fill = "#CCC591", alpha = 1, binwidth = .2, boundary = 0) +
  theme(panel.grid = element_blank(),
        axis.title = element_blank())

Extending the inverse process to generate truncation

Let’s say we are only interested in generating data from the middle portion of the \(N(0,3)\) distribution, between \(a\) and \(b\). The trick is to use the corresponding CDF values, \(F(a)\) and \(F(b)\) as the basis of the randomization.

To generate data within the constraints \(a\) and \(b\), all we would need to do is generate a value from the uniform distribution with minimum equal to \(F(a)\) and maximum \(F(b)\). We then conduct the mapping as we did before when drawing from the full distribution. By constraining \(u\) to be between \(F(a)\) and \(F(b)\), we force the values of the target distribution to lie between \(a\) and \(b\).

Now, we are ready to create a simple function rnormt that implements this: The pnorm function provides the CDF at a particular value:

rnormt <- function(n, range, mu, s = 1) {
  
  # range is a vector of two values
  
  F.a <- pnorm(min(range), mean = mu, sd = s)
  F.b <- pnorm(max(range), mean = mu, sd = s)
  
  u <- runif(n, min = F.a, max = F.b)
  
  qnorm(u, mean = mu, sd = s)
  
}

Here, I am generating the data plotted above, showing the code this time around.

library(data.table)
library(simstudy)
library(paletteer)

defC <- defCondition(condition= "tt == 1", 
                     formula = "rnormt(10000, c(-Inf, Inf), mu = 0, s = 3)")
defC <- defCondition(defC, "tt == 2", 
                     formula = "rnormt(10000, c(0, Inf), mu = 0, s = 3)")
defC <- defCondition(defC, "tt == 3", 
                     formula = "rnormt(10000, c(-3, 3.5), mu = 0, s = 3)")

dd <- genData(30000)
dd <- trtAssign(dd, nTrt = 3, grpName = "tt")
dd <- addCondition(defC, dd, "x")

dd[, tt := factor(tt, 
     labels = c("No truncation", "Left truncation at 0", "Left and right truncation"))]

ggplot(data = dd, aes(x = x, group = tt)) +
  geom_histogram(aes(fill = tt), alpha = 1, binwidth = .2, boundary = 0) +
  facet_grid(~tt) +
  theme(panel.grid = element_blank(),
        axis.title = element_blank(),
        legend.position = "none") +
  scale_fill_paletteer_d("wesanderson::Moonrise2")

Going beyond the normal distribution

With this simple approach, it is possible to generate a truncated distribution using any distribution available in R. Here is another example that allows us to generate truncated data from a gamma distribution:

rgammat <- function(n, range, shape, scale = 1) {
  
  F.a <- pgamma(min(range), shape = shape, scale = scale)
  F.b <- pgamma(max(range), shape = shape, scale = scale)
  
  u <- runif(n, min = F.a, max = F.b)

  qgamma(u, shape = shape, scale = scale)

}

To conclude, here is a plot of gamma-based distributions using rgammat. And I’ve added similar plots for beta and Poisson distributions – I’ll leave it to you to write the functions. But, if you don’t want to do that, simstudy will be updated at some point soon to help you out.

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)