Site icon R-bloggers

Making random draws from an arbitrarily defined pdf

[This article was first published on Econometrics by Simulation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I recently found myself in need of a function to sample randomly from an arbitrarily defined probability density function. An excellent post by Quantitations shows how to accomplish this using some of Rs fairly sophisticated functional approximation tools such as integrate and uniroot. The only problem with this excellent post was that the machine cost was enormous with samples of 1000 draws taking 5 seconds on my machine with repeated samples of 100,000+ draws (which I was after) clearly being unworkable.

Thus I decided to take my own crack at it. First let us review the basics of drawing random variables from non-uniform distributions. The standard method I think most algorithms use works as follows:

Assumptions
1. You can draw pseudo-random uniform variable u
2. You can integrate the pdf to construct a cdf
$$p = F(x) = \int_{-\infty}^\infty f(x) dx$$
3. You can invert the cdf in order to solve for p
$$G(F(x))=F^{-1}(F(x))=F^{-1}(p)=x$$

The method thus relies upon the somewhat simple method of calculating x by drawing u and plugging into G (the inverse of F).

Now let us assume that we are in a bit of a bind. We can neither integrate f(x) nor invert F(x). The previously mentioned post by Quantitations demonstrates how to do the operation in this case by directly approximating the cdf followed by approximating the inverse. This process is computationally intensive for simple functions and extremely time consuming for complex functions.

In contrast I approximate the cdf by drawing x values and associated probabilities from the pdf along a user specified range. I create a matrix of bins over which the approximate cdf is divided into. After that I apply some function (mean or median) over the range of x which corresponds to each bin providing an x point value for each cdf bin value.  The gacdf returns a list of results from this process.

The ricdf function then takes the list of returns and is able to draw values from the approximated inverse cdf. Once the gicdf has completed its operation, ricdf is able to generate variables nearly as fast as that of standard non-uniform random variables.

As a matter of comparison, I define the funciton f as the pdf of the normal (dnorm) in R and draw from it 1000 time. Using rnorm, ricdf [defined in this post], and samplepdf [defined at Quantitations] I graphically see how the three draws compare below.
Three random sampling procedures for the random normal.
On the graph, Black type=0, is rnorm. Dark blue type=1, is ricdf. Light blue, type=2 is samplepdf. Redrawing the graph several times, visually I could not tell the difference between the three methods.

Comparing run times of the three methods over ten repetitions rnorm used no seconds, ricdf (with ficdf) used .8 seconds, and samplepdf used 5. Because ricdf and gicdf are separate functions with gicdf setting up the table to draw from, we only need to set it up once per pdf so ricf can be benchmarked separately from gicfd. Comparing 1000 replications of rnorm and ricdf, rnorm took .1 seconds and ricdf took .11 seconds. I did not compare with samplepdf which I expected would take 5200 seconds (87 minutes).

For fun I have generated several other strange ‘proportional’ pdfs.  I say proportional because strictly speaking pdfs are required to integrate to 1. However, gicdf forces the probabilities to sum to 1 across the range making the formal requirement not necessary.

I define a piecewise pdf


And sample from it.


I also define a multimodal  pdf



And sample from it.

Find the R code below or the gist on github.

gicdf <- function(fun, 
                 min=-3.5, 
                 max=3.5, 
                 bins=1000, 
                 nqratio=10, 
                 grouping=mean, 
                 ...) {
  # Generate an inverse CDF of an arbitrary function
  fun <- match.fun(fun)
  grouping <- match.fun(grouping)
 
  # Number of points to draw
  nq=nqratio*bins
 
  # Draw indexes
  qdraw <- seq(min, max,length.out=nq)
 
  # Calculate proportional probability of each draw
  pdraw <- fun(qdraw,...)
 
  # Rescale probability sum to 1, rescale
  pdraw <- pdraw/sum(pdraw)
 
  # Calculate the cumulative probability at each qdraw
  cpdraw <- cumsum(pdraw)
 
  # Caculate the cumulative probability at each bin
  pbin <- (1:bins)/bins
  xbin <- NA*(1:bins)
 
  for (i in 1:bins) {
    xbin[i] <- grouping(qdraw[cpdraw<pbin[i]&cpdraw>0], na.rm = TRUE)
    cpdraw[cpdraw<pbin[i]] <- 2
  }
 
  (draw.set <- list(digits=floor(log10(bins)), xbin=xbin, pbin=pbin))
}
 
# Draw from acdf
ricdf <- function(N, draw.set) {
  digits <- draw.set$digits
  pdraws <- ceiling(runif(N)*10^digits)/10^digits
  draw.set$xbin[match(pdraws,draw.set$pbin)]
}
 
# Define an arbitrary pdf f: to compare lets start with normal pdf
f <- function(x) dnorm(x)
 
library(ggplot2)
 
# set the number to sample
nsamples <- 1000
 
# Draw from rnorm
sample0 <- cbind(draw=rnorm(nsamples), type=0)
 
# Draw inverted cdf distribution information for range between -4 and 4
# using mean approach
sample1 <- cbind(draw=ricdf(nsamples, gicdf(f,min=-4,max=4)), type=1)
 
# samplepdf and endsign defined on Quantitations blog
# begin Quantitations code
endsign <- function(f, sign = 1) {
  b <- sign
  while (sign * f(b) < 0) b <- 10 * b
  return(b)
}
samplepdf <- function(n, pdf, ..., spdf.lower = -Inf, spdf.upper = Inf) {
  vpdf <- function(v) sapply(v, pdf, ...)  # vectorize
  cdf <- function(x) integrate(vpdf, spdf.lower, x)$value
  invcdf <- function(u) {
    subcdf <- function(t) cdf(t) - u
    if (spdf.lower == -Inf) 
      spdf.lower <- endsign(subcdf, -1)
    if (spdf.upper == Inf) 
      spdf.upper <- endsign(subcdf)
    return(uniroot(subcdf, lower=spdf.lower, upper=spdf.upper)$root)
  }
  sapply(runif(n), invcdf)
}
# end Quantitations code
sample2 <- cbind(draw=samplepdf(nsamples, f), type=2)
 
# join the three samples together
samples <- as.data.frame(rbind(sample0, sample1, sample2))
 
ggplot(samples, aes(x=draw, color=type))+
  geom_density(aes(group=type, size=-type))
 
library(rbenchmark)
# Let us do some speed comparisons
benchmark(rnorm(nsamples), 
          ricdf(nsamples, gicdf(f,min=-4,max=4)),
          samplepdf(nsamples, f),
          replications=10)
# test replications elapsed relative user.self
#2 ricdf(nsamples, gicdf(f, min = -4, max = 4))   10    7.97
#1                              rnorm(nsamples)   10    0.00
#3                       samplepdf(nsamples, f)   10   52.39
 
# This sets the entire process of generating and drawing from ricdf as about 7 times 
# faster than samplepdf. However, if we seperate the gicdf function from the 
# ricdf function which only need be run each time a new pdf is input, then we can
# considerably increase run speeds.
 
myicdf <- gicdf(f,min=-4,max=4)
# Sets up the initial icdf to draw from. 
 
benchmark(rnorm(nsamples), ricdf(nsamples, myicdf), replications=1000)
# elapsed time for rnorm = .10 seconds and ricdf = .11
# thus drawing from ricdf can be extremely fast
 
# Now let's look at an unusual proportional pdf distribution
f <- function(x) {
  p <- rep(0, length(x))
  p[x>0&x<1] <- x[x>0&x<1]^2
  p[x>2&x<3] <- 3-x[x>2&x<3]^.5
  p[x>4&x<5] <- x[x>4&x<5]
  p
}
 
x <- seq(-1,6,.01)
plot(x,f(x), type='l', 
     xlab='Proportional Probability',
     main='Proportional pdf')
# A key thing to note is that the pdf does not need to integrate to 1.
# The gicdf function will rescale it to ensure it does.
# However, it should all be positive
 
# I define bins as equal to 10,000 when normally they are equal to 1000
myicdf <- gicdf(f,min=-1,max=6, bins=1000)
samples <- data.frame(draws=ricdf(100000,myicdf))
 
ggplot(samples, aes(x=draws))+
  geom_histogram(binwidth=.1)+
  aes(y = ..density..)
 
 
# Okay here is one more probability
f <- function(x) dnorm(x)*abs(1+x^4)
 
x <- seq(-5,5,.01)
plot(x,f(x), type='l', 
     xlab='Proportional Probability',
     main='Proportional pdf2')
 
 
myicdf <- gicdf(f,min=-5,max=5, bins=1000)
samples <- data.frame(draws=ricdf(100000,myicdf))
 
ggplot(samples, aes(x=draws))+
  geom_histogram(binwidth=.2)+
  aes(y = ..density..)
 
# I will leave you to define your own pdfs to draw from.
# If you end up using this method, please cite!





To leave a comment for the author, please follow the link and comment on their blog: Econometrics by Simulation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.