# Sampling for Monte Carlo simulations with R

October 31, 2011
By

When doing Monte Carlo simulation, it’s important to pick your parameter values efficiently especially if your model is computationally expensive to run. If the model takes two days to run, and a parameter $x$ ranges from 0 to 10, it doesn’t make much sense to run it once at $x=4.9$ and again at $x=5.1$ if $x \in (0,4), (6,10)$ hasn’t been explored at all.

To get around this problem, one can use quasi-random low-discrepancy sequences which are designed to fill a parameter space efficiently. The R package, randtoolbox, provides implementations of common sequences, like the Halton or Sobol’, but the process involves a couple of steps that beg to be automated. The general process is:

1. Generate a n x p matrix of uniformly distributed quasi-random values, where n is the number of simulations you wish to run and p is the number of parameters.
2. For each column of the matrix, convert the quasi-random value to the parameter’s actual distribution by inverting the cdf curve. So if you have a uniformly distributed value of 0.5, and you want to convert it to a normal distribution with mean $\mu$ you would get a parameter value of $\mu$ for use in your simulation.
3. Run your simulation with these parameter values, and analyse the results

I’ve written a little R function to make this process easier. You simply pass it the number of simulations you want to run, and a list describing each parameter, and it will return the Monte Carlo sample as a data frame. At the moment, it’s pretty rudimentry, and each parameter is described by a name, a distribution name (matching the R abbreviations, e.g. “unif” for the uniform distribution, “norm” for the normal), and two parameters to describe the distribution.

# Generate a Monte Carlo sample
generateMCSample <- function(n, vals) {
# Packages to generate quasi-random sequences
# and rearrange the data
require(randtoolbox)
require(plyr)

# Generate a Sobol' sequence
sob <- sobol(n, length(vals))

# Fill a matrix with the values
# inverted from uniform values to
# distributions of choice
samp <- matrix(rep(0,n*(length(vals)+1)), nrow=n)
samp[,1] <- 1:n
for (i in 1:length(vals)) {
l <- vals[[i]]
dist <- l$dist params <- l$params
samp[,i+1] <- eval(call(paste("q",dist,sep=""),sob[,i],params[1],params[2]))
}

# Convert matrix to data frame and label
samp <- as.data.frame(samp)
names(samp) <- c("n",laply(vals, function(l) l\$var))
return(samp)
}

Created by Pretty R at inside-R.org

Here’s a simple example to show how it can be used.

n <- 1000  # number of simulations to run

# List described the distribution of each variable
vals <- list(list(var="Uniform",
dist="unif",
params=c(0,1)),
list(var="Normal",
dist="norm",
params=c(0,1)),
list(var="Weibull",
dist="weibull",
params=c(2,1)))

# Generate the sample
samp <- generateMCSample(n,vals)

# Plot with ggplot2
library(ggplot2)
samp.mt <- melt(samp,id="n")
gg <- ggplot(samp.mt,aes(x=value)) +
geom_histogram(binwidth=0.1) +
theme_bw() +
facet_wrap(~variable, ncol=3,scale="free")

Created by Pretty R at inside-R.org

And here’s the resulting picture:

Histogram of three parameters in a Monte Carlo sample

Any suggestions on how to improve this function so that it has a more generic description of a distribution would be appreciated (e.g. for distributions with n!=2 parameters).

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...