Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently found myself in need of a function to sample randomly from an arbitrarily defined probability density function. An excellent post by Quantitations shows how to accomplish this using some of Rs fairly sophisticated functional approximation tools such as integrate and uniroot. The only problem with this excellent post was that the machine cost was enormous with samples of 1000 draws taking 5 seconds on my machine with repeated samples of 100,000+ draws (which I was after) clearly being unworkable.

Thus I decided to take my own crack at it. First let us review the basics of drawing random variables from non-uniform distributions. The standard method I think most algorithms use works as follows:

Assumptions
1. You can draw pseudo-random uniform variable u
2. You can integrate the pdf to construct a cdf
$$p = F(x) = \int_{-\infty}^\infty f(x) dx$$
3. You can invert the cdf in order to solve for p
$$G(F(x))=F^{-1}(F(x))=F^{-1}(p)=x$$

The method thus relies upon the somewhat simple method of calculating x by drawing u and plugging into G (the inverse of F).

Now let us assume that we are in a bit of a bind. We can neither integrate f(x) nor invert F(x). The previously mentioned post by Quantitations demonstrates how to do the operation in this case by directly approximating the cdf followed by approximating the inverse. This process is computationally intensive for simple functions and extremely time consuming for complex functions.

In contrast I approximate the cdf by drawing x values and associated probabilities from the pdf along a user specified range. I create a matrix of bins over which the approximate cdf is divided into. After that I apply some function (mean or median) over the range of x which corresponds to each bin providing an x point value for each cdf bin value.  The gacdf returns a list of results from this process.

The ricdf function then takes the list of returns and is able to draw values from the approximated inverse cdf. Once the gicdf has completed its operation, ricdf is able to generate variables nearly as fast as that of standard non-uniform random variables.

As a matter of comparison, I define the funciton f as the pdf of the normal (dnorm) in R and draw from it 1000 time. Using rnorm, ricdf [defined in this post], and samplepdf [defined at Quantitations] I graphically see how the three draws compare below.

 Three random sampling procedures for the random normal.

On the graph, Black type=0, is rnorm. Dark blue type=1, is ricdf. Light blue, type=2 is samplepdf. Redrawing the graph several times, visually I could not tell the difference between the three methods.

Comparing run times of the three methods over ten repetitions rnorm used no seconds, ricdf (with ficdf) used .8 seconds, and samplepdf used 5. Because ricdf and gicdf are separate functions with gicdf setting up the table to draw from, we only need to set it up once per pdf so ricf can be benchmarked separately from gicfd. Comparing 1000 replications of rnorm and ricdf, rnorm took .1 seconds and ricdf took .11 seconds. I did not compare with samplepdf which I expected would take 5200 seconds (87 minutes).

For fun I have generated several other strange ‘proportional’ pdfs.  I say proportional because strictly speaking pdfs are required to integrate to 1. However, gicdf forces the probabilities to sum to 1 across the range making the formal requirement not necessary.

I define a piecewise pdf

And sample from it.

I also define a multimodal  pdf

And sample from it.

Find the R code below or the gist on github.

gicdf <- function(fun,
min=-3.5,
max=3.5,
bins=1000,
nqratio=10,
grouping=mean,
...) {
# Generate an inverse CDF of an arbitrary function
fun <- match.fun(fun)
grouping <- match.fun(grouping)

# Number of points to draw
nq=nqratio*bins

# Draw indexes
qdraw <- seq(min, max,length.out=nq)

# Calculate proportional probability of each draw
pdraw <- fun(qdraw,...)

# Rescale probability sum to 1, rescale
pdraw <- pdraw/sum(pdraw)

# Calculate the cumulative probability at each qdraw
cpdraw <- cumsum(pdraw)

# Caculate the cumulative probability at each bin
pbin <- (1:bins)/bins
xbin <- NA*(1:bins)

for (i in 1:bins) {
xbin[i] <- grouping(qdraw[cpdraw0], na.rm = TRUE)
cpdraw[cpdraw0&x<1] <- x[x>0&x<1]^2
p[x>2&x<3] <- 3-x[x>2&x<3]^.5
p[x>4&x<5] <- x[x>4&x<5]
p
}

x <- seq(-1,6,.01)
plot(x,f(x), type='l',
xlab='Proportional Probability',
main='Proportional pdf')
# A key thing to note is that the pdf does not need to integrate to 1.
# The gicdf function will rescale it to ensure it does.
# However, it should all be positive

# I define bins as equal to 10,000 when normally they are equal to 1000
myicdf <- gicdf(f,min=-1,max=6, bins=1000)
samples <- data.frame(draws=ricdf(100000,myicdf))

ggplot(samples, aes(x=draws))+
geom_histogram(binwidth=.1)+
aes(y = ..density..)

# Okay here is one more probability
f <- function(x) dnorm(x)*abs(1+x^4)

x <- seq(-5,5,.01)
plot(x,f(x), type='l',
xlab='Proportional Probability',
main='Proportional pdf2')

myicdf <- gicdf(f,min=-5,max=5, bins=1000)
samples <- data.frame(draws=ricdf(100000,myicdf))

ggplot(samples, aes(x=draws))+
geom_histogram(binwidth=.2)+
aes(y = ..density..)

# I will leave you to define your own pdfs to draw from.
# If you end up using this method, please cite!