By Joseph Rickert
The ability to generate synthetic data with a specified correlation structure is essential to modeling work. As you might expect, R’s toolbox of packages and functions for generating and visualizing data from multivariate distributions is impressive. The basic function for generating multivariate normal data is mvrnorm() from the MASS package included in base R, although the mvtnorm package also provides functions for simulating both multivariate normal and t distributions. (For tutorial on how to use R to simulate from multivariate normal distributions from first principles using some linear algebra and the Cholesky decomposition see the astrostatistics tutorial on Multivariate Computations.)
The following block of code generates 5,000 draws from a bivariate normal distribution with mean (0,0) and covariance matrix Sigma printed in code. The function kde2d(), also from the Mass package generates a two-dimensional kernel density estimation of the distribution's probability density function.
# SIMULATING MULTIVARIATE DATA # https://stat.ethz.ch/pipermail/r-help/2003-September/038314.html # lets first simulate a bivariate normal sample library(MASS) # Simulate bivariate normal data mu <- c(0,0) # Mean Sigma <- matrix(c(1, .5, .5, 1), 2) # Covariance matrix # > Sigma # [,1] [,2] # [1,] 1.0 0.1 # [2,] 0.1 1.0 # Generate sample from N(mu, Sigma) bivn <- mvrnorm(5000, mu = mu, Sigma = Sigma ) # from Mass package head(bivn) # Calculate kernel density estimate bivn.kde <- kde2d(bivn[,1], bivn[,2], n = 50) # from MASS package
R offers several ways of visualizing the distribution. These next two lines of code overlay a contour plot on a "heat Map" that maps the density of points to a gradient of colors.
This plots the irregular contours of the simulated data. The code below which uses the ellipse() function from the ellipse package generates the classical bivariate normal distribution plot that graces many a textbook.
# Classic Bivariate Normal Diagram library(ellipse) rho <- cor(bivn) y_on_x <- lm(bivn[,2] ~ bivn[,1]) # Regressiion Y ~ X x_on_y <- lm(bivn[,1] ~ bivn[,2]) # Regression X ~ Y plot_legend <- c("99% CI green", "95% CI red","90% CI blue", "Y on X black", "X on Y brown") plot(bivn, xlab = "X", ylab = "Y", col = "dark blue", main = "Bivariate Normal with Confidence Intervals") lines(ellipse(rho), col="red") # ellipse() from ellipse package lines(ellipse(rho, level = .99), col="green") lines(ellipse(rho, level = .90), col="blue") abline(y_on_x) abline(x_on_y, col="brown") legend(3,1,legend=plot_legend,cex = .5, bty = "n")
The next bit of code generates a couple of three dimensional surface plots. The second of which is an rgl plot that you will be able to rotate and view from different perspectives on your screen.
The code that follows uses the rtmvt() function from the tmvtnorm package to generate bivariate t distribution. The rgl plot renders the surface kernel density estimate of the surface in impressive detail.
# Draw from multi-t distribution without truncation library (tmvtnorm) Sigma <- matrix(c(1, .1, .1, 1), 2) # Covariance matrix X1 <- rtmvt(n=1000, mean=rep(0, 2), sigma = Sigma, df=2) # from tmvtnorm package t.kde <- kde2d(X1[,1], X1[,2], n = 50) # from MASS package col2 <- heat.colors(length(bivn.kde$z))[rank(bivn.kde$z)] persp3d(x=t.kde, col = col2)
The real value of the multivariate distribution functions from the data science perspective is to simulate data sets with many more than two variables. The functions we have been considering are up to the task, but there are some technical considerations and, of course, we don't have the same options for visualization. The following code snippet generates 10 variables from a multivariate normal distribution with a specified covariance matrix. Note that I've used the genPositiveDefmat() function from the clusterGeneration package to generate the covariance matrix. This is because mvrnorm() will throw an error, as theory says it should, if the covariance matrix is not positive definite, and guessing a combination of matrix elements to make a high dimensional matrix positive definite would require quite a bit of luck along with some serious computation time.
After generating the matrix, I use the corrplot() function from the corrplot package to produce an attractive pairwise correlation plot that is coded both by shape and color. corrplot() scales pretty well with the number of variables and will give a decent chart with 40 to 50 variables. (Note that now ggcorrplot will do this for ggplot2 plots.) Other plotting options would be to generate pairwise scatter plots and R offers many alternatives for these.
Finally, what about going beyond the multivariate normal and t distributions? R does have a few functions like rlnorm() from the compositions package which generates random variates from the multivariate lognormal distribution that are as easy to use as mvrorm(), but you will have to hunt for them. I think a more fruitful approach if you are serious about probability distributions is to get familiar with the copula package.