Site icon R-bloggers

Easily generate correlated variables from any distribution

[This article was first published on Econometrics by Simulation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post I will demonstrate in R how to draw correlated random variables from any distribution

The idea is simple. 
1. Draw any number of variables from a joint normal distribution. 2. Apply the univariate normal CDF of variables to derive probabilities for each variable. 3. Finally apply the inverse CDF of any distribution to simulate draws from that distribution.
The results is that the final variables are correlated in a similar manner to that of the original variables.  This is because the rank order of the variables in maintained and thus correlations are approximately the same though not exact. This methods follows a method I presented in a previous post coded in Stata.  I am not aware of anybody else proposing this method previously.

For example:
library(MASS)
 
# We will use the command mvrnorm to draw a matrix of variables
 
# Let's keep it simple, 
mu <- rep(0,4)
Sigma <- matrix(.7, nrow=4, ncol=4) + diag(4)*.3
 
rawvars <- mvrnorm(n=10000, mu=mu, Sigma=Sigma)
 
cov(rawvars); cor(rawvars)
# We can see our normal sample produces results very similar to our 
#specified covariance levels.
 
# No lets transform some variables
pvars <- pnorm(rawvars)
 
# Through this process we already have 
cov(pvars); cor(pvars)
# We can see that while the covariances have dropped significantly, 
# the simply correlations are largely the same.
 
plot(rawvars[,1], pvars[,2], main="Normal of Var 1 with probabilities of Var 2") 
 
# Things are looking pretty well so far.  Let's see what happens when we invert 
#different CDFs.
 
# To generate correlated poisson
poisvars <- qpois(pvars, 5)
cor(poisvars, rawvars) 
# This matrix presents the correlation between the original values generated
# and the tranformed poisson variables.  We can see that the correlation matrix
# is very similar though somewhat "downward biased".  This is because any
# transformation away from the original will reduce the correlation between the
# variables.
 
plot(poisvars,rawvars, main="Poisson Transformation Against Normal Values")

 

 
# Perhaps the poisson count distribution is not exotic enough.
 
# Perhaps a binomial distribution with 3 draws at 25% each
binomvars <- qpois(1-pvars, 3, .25) 
  # Note, I did 1-p because p is defined differently for the qpois for some 
#reason
cor(binomvars, rawvars) 
 
# Or the exponential distribution
expvars <- qexp(pvars)
cor(expvars, rawvars)
# We can see that the correlations after the exponential tranformations are
# significantly weaker (from .7 to .63) but still good representations if
# we are interested in simulating correlations between a normal and exponential
# variables.
 
plot(expvars,rawvars, main="Exponential Transformation Against Normal Values") 
# To make things a little more interesting, let's now transform our probabilities
# into skewed normal distributions.
library(sn)
sknormvars <- qsn(pvars, 5, 2, 5)
cor(expvars, rawvars)
 
hist(sknormvars, breaks=20) 
# Finally in order to demonstrate what we can do let's combine our variables into
# a single matrix.
 
combvar <- dataframe(sknormvars[,1], poisvars[,2], binomvars[,3], expvars[,4])
 
cor(combvar)
#          [,1]      [,2]      [,3]      [,4]
#[1,] 1.0000000 0.6853314 0.6826398 0.6256086
#[2,] 0.6853314 1.0000000 0.6748458 0.6402233
#[3,] 0.6826398 0.6748458 1.0000000 0.6325102
#[4,] 0.6256086 0.6402233 0.6325102 1.0000000
 
# I am going to try to get all of these dirstributions on the same graph
stdcombvar <- t(t(combvar)-apply(combvar,2,min))
stdcombvar <- t(t(stdcombvar)/apply(stdcombvar,2,max))
summary(stdcombvar)
 
plotter <- data.frame(
  values = c(stdcombvar),
  rawnorm = rep(rawvars[,1], 4),
  type = rep(c("skewed normal", 
             "poisson", 
             "binomial", 
             "exponential"), 
             each=10000))
 
library(ggplot2)  
  
ggplot(plotter, aes(x=rawnorm ,y=values, color=type)) +
  geom_point(shape=1) +     # Use hollow circles
  geom_smooth(method=lm,    # Add linear regression line
              se=FALSE)  
 
 
Formatted by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on their blog: Econometrics by Simulation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.