Flexible correlation generation: an update to genCorMat in simstudy

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been slowly working on some updates to simstudy, focusing mostly on the functionality to generate correlation matrices (which can be used to simulate correlated data). Here, I’m briefly describing the function genCorMat, which has been updated to facilitate the generation of correlation matrices for clusters of different sizes and with potentially different correlation coefficients.

I’ll briefly describe what the existing function can currently do, and then give an idea about what the enhancements will provide.

Simple correlation matrix generation

In its original form, genCorMat could generate a single (square) correlation matrix of a specified dimension. This could a randomly generated (valid) correlation matrix, or a correlation matrix with a set of specified coefficients.

Here is an example of the first, a randomly generated correlation matrix:

library(simstudy)
library(data.table)
set.seed(37265)

genCorMat(4)
##             [,1]        [,2]        [,3]       [,4]
## [1,]  1.00000000 -0.22742403  0.01285282 -0.3201579
## [2,] -0.22742403  1.00000000 -0.04973973 -0.1218070
## [3,]  0.01285282 -0.04973973  1.00000000 -0.2940923
## [4,] -0.32015788 -0.12180695 -0.29409234  1.0000000

And here is a matrix with a specified set of coefficients (and you well get an error message if it is not positive semidefinite!):

R <- genCorMat(4, cors = c(0.6, 0.4, 0.2, 0.5, 0.3, 0.4))
R
##      [,1] [,2] [,3] [,4]
## [1,]  1.0  0.6  0.4  0.2
## [2,]  0.6  1.0  0.5  0.3
## [3,]  0.4  0.5  1.0  0.4
## [4,]  0.2  0.3  0.4  1.0

This matrix can be used to generate data using functions genCorData or genCorGen:

dd <- genCorGen(n = 1000, nvars = 4, corMatrix = R, params1 = c(3, 5, 8, 9), 
  dist = "poisson", wide = TRUE)

head(dd)
##    id V1 V2 V3 V4
## 1:  1  3  3  5  8
## 2:  2  3  9 12  8
## 3:  3  1  2 14 12
## 4:  4  4  9 13 14
## 5:  5  4  9  7 11
## 6:  6  4  5  6  8

And the correlation from this data set is quite close to the specified matrix R.

round(cor(as.matrix(dd[, -1])), 1)
##     V1  V2  V3  V4
## V1 1.0 0.6 0.4 0.2
## V2 0.6 1.0 0.5 0.3
## V3 0.4 0.5 1.0 0.5
## V4 0.2 0.3 0.5 1.0

Specifying a structure

With the updated version of genCorMat, it is now possible to specify an exchangeable/compound symmetry or auto-regressive structure. Here is the compound symmetry structure:

genCorMat(nvars = 4, rho = 0.6, corstr = "cs")
##      [,1] [,2] [,3] [,4]
## [1,]  1.0  0.6  0.6  0.6
## [2,]  0.6  1.0  0.6  0.6
## [3,]  0.6  0.6  1.0  0.6
## [4,]  0.6  0.6  0.6  1.0

And here is a matrix with an auto-regressive or decaying structure:

genCorMat(nvars = 4, rho = 0.6, corstr = "ar1")
##       [,1] [,2] [,3]  [,4]
## [1,] 1.000 0.60 0.36 0.216
## [2,] 0.600 1.00 0.60 0.360
## [3,] 0.360 0.60 1.00 0.600
## [4,] 0.216 0.36 0.60 1.000

Cluster-specific correlation matrices

The final major enhancement is the capability to generate a list of correlation matrices, each of which corresponds to a specific cluster. These matrices can be of different sizes (to accommodate different cluster sizes) and have different parameters (if not random). The only constraints are that the overall structure of matrices need to be the same (i.e. random, cs, or ar1), and it is not possible to use the cors argument (since the number of correlation parameters would be different depending on the cluster size).

In this example, I am generating matrices with a cs structure for four clusters with sizes 2, 3, 4, and 3, respectively, and within-cluster correlations of \(\rho_1 = 0.6\), \(\rho_2 = 0.7\), \(\rho_3 = 0.5\), and \(\rho_4 = 0.4\). This reflects an overall block correlation matrix that looks like this:

R=(1.00.60.61.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.01.00.70.70.71.00.70.70.71.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.01.00.50.50.50.51.00.50.50.50.51.00.50.50.50.51.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.01.00.40.40.41.00.40.40.41.0)\scriptsize{ R = \left ( \begin{array}{c|c|c|c} \begin{matrix} 1.0 & 0.6 \\ 0.6 & 1.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 0.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \end{matrix} \\ \hline \begin{matrix} 0.0 & 0.0 \\ 0.0 & 0.0 \\ 0.0 & 0.0 \end{matrix} & \begin{matrix} 1.0 & 0.7 & 0.7 \\ 0.7 & 1.0 & 0.7 \\ 0.7 & 0.7 & 1.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 0.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \end{matrix} \\ \hline \begin{matrix} 0.0 & 0.0 \\ 0.0 & 0.0 \\ 0.0 & 0.0 \\ 0.0 & 0.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \end{matrix} & \begin{matrix} 1.0 & 0.5 & 0.5 & 0.5 \\ 0.5 & 1.0 & 0.5 & 0.5 \\ 0.5 & 0.5 & 1.0 & 0.5 \\ 0.5 & 0.5 & 0.5 & 1.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \end{matrix} \\ \hline \begin{matrix} 0.0 & 0.0 \\ 0.0 & 0.0 \\ 0.0 & 0.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 0.0 \end{matrix} & \begin{matrix} 1.0 & 0.4 & 0.4 \\ 0.4 & 1.0 & 0.4 \\ 0.4 & 0.4 & 1.0 \end{matrix} \\ \end{array} \right ) }

Each column represents an individual unit (and so does each row). Reading down a column (or across a row) gives the correlations with the other individual units. The clusters are represented by the grids drawn over the matrix. In this case, individuals are correlated only with other individuals in the same cluster.

To generate this system of matrices, we just need to specify the number of observations per cluster (\(nvars\)), the correlation coefficients for each cluster (\(rho\), which in this case is a vector), and the number of clusters. The \(nvars\) argument needs to match the numbers of individuals in each cluster in the data set, and the lengths of \(nvars\) and \(rho\) must be the same as the number of clusters (though either or both can be scalars, in which case the values are shared across the clusters). The output is a list of correlation matrices, one for each cluster.

RC <- genCorMat(nvars = c(2, 3, 4, 3), rho = c(0.6, 0.7, 0.5, 0.4), 
  corstr = "cs", nclusters = 4)

RC
## $`1`
##      [,1] [,2]
## [1,]  1.0  0.6
## [2,]  0.6  1.0
## 
## $`2`
##      [,1] [,2] [,3]
## [1,]  1.0  0.7  0.7
## [2,]  0.7  1.0  0.7
## [3,]  0.7  0.7  1.0
## 
## $`3`
##      [,1] [,2] [,3] [,4]
## [1,]  1.0  0.5  0.5  0.5
## [2,]  0.5  1.0  0.5  0.5
## [3,]  0.5  0.5  1.0  0.5
## [4,]  0.5  0.5  0.5  1.0
## 
## $`4`
##      [,1] [,2] [,3]
## [1,]  1.0  0.4  0.4
## [2,]  0.4  1.0  0.4
## [3,]  0.4  0.4  1.0

To create these correlated data, first we can generate a data set of individuals that are clustered in groups. The outcome will be Poisson distributed, so we are specifying mean \(\lambda\) for each cluster:

d1 <- defData(varname = "n", formula = "c(2, 3, 4, 3)", dist = "nonrandom")
d1 <- defData(d1, varname = "lambda", formula = "c(6, 7, 9, 8)", dist = "nonrandom")

ds <- genData(4, d1, id = "site")
dc <- genCluster(dtClust = ds, cLevelVar = "site", numIndsVar = "n", "id")

Now, we can generate some data using the correlation matrix RC:

dd <- addCorGen(dc, idvar = "site", param1 = "lambda", corMatrix = RC,
          dist = "poisson", cnames = "y", method = "copula")

dd
##     site n lambda id  y
##  1:    1 2      6  1 11
##  2:    1 2      6  2  7
##  3:    2 3      7  3  4
##  4:    2 3      7  4  3
##  5:    2 3      7  5  5
##  6:    3 4      9  6  8
##  7:    3 4      9  7  7
##  8:    3 4      9  8 10
##  9:    3 4      9  9 11
## 10:    4 3      8 10  2
## 11:    4 3      8 11  6
## 12:    4 3      8 12  4

If we want to confirm that everything is working as expected, we can recover the overall correlation matrix by generating a large number of data sets (in this case 5000):

replicate <- function(R, dc) {
  reps <- lapply(1:5000, function(x)
  addCorGen(dc, idvar = "site", param1 = "lambda", corMatrix = R,
    dist = "poisson", cnames = "y", method = "copula")
  )

  drep <- data.table::rbindlist(reps, idcol = "rep")
  drep[, seq := 1:.N, keyby = rep]
  dmat <- as.matrix(dcast(drep, rep ~ seq, value.var = "y")[, -1])
  round(cor(dmat), 1) 
}

replicate(R = RC, dc = dc)
##      1   2   3   4   5   6   7   8   9  10  11  12
## 1  1.0 0.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## 2  0.6 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## 3  0.0 0.0 1.0 0.7 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## 4  0.0 0.0 0.7 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## 5  0.0 0.0 0.7 0.7 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## 6  0.0 0.0 0.0 0.0 0.0 1.0 0.5 0.5 0.5 0.0 0.0 0.0
## 7  0.0 0.0 0.0 0.0 0.0 0.5 1.0 0.5 0.5 0.0 0.0 0.0
## 8  0.0 0.0 0.0 0.0 0.0 0.5 0.5 1.0 0.5 0.0 0.0 0.0
## 9  0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.5 1.0 0.0 0.0 0.0
## 10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.4 0.4
## 11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4 1.0 0.4
## 12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.4 1.0

It seems to have worked quite well - this empirical matrix matches the hypothetical matrix above. In the next post, I’ll describe how block matrices for different clusters over different time periods can also be flexibly generated for different groups.

Addendum

As a bonus feature, here is a code snippet that generates data for a large number of clusters, where the parameters (cluster means, variance, and correlation coefficients) themselves are randomly generated. By providing this flexibility, we induce a lot of variability in the data generation process.

d1 <- defData(varname = "n", formula = 20, dist = "noZeroPoisson")
d1 <- defData(d1, varname = "mu", formula = 10, variance = 8, dist = "normal")
d1 <- defData(d1, varname = "s2", formula = 4, dist = "nonrandom")

ds <- genData(100, d1, id = "site")
dd <- genCluster(dtClust = ds, cLevelVar = "site", numIndsVar = "n", "id")

n <- dd[, .N, keyby = site][, N]
nsites <- length(n)
rho <- rbeta(nsites, 25, 15)

RM <- genCorMat(nvars = n, rho = rho, corstr = "cs", nclusters = nsites)

dd <- addCorGen(dd, idvar = "site", param1 = "mu", param2 = "s2",
                corMatrix = RM, dist = "normal", cnames = "y", method = "copula")

dd
##       site  n        mu s2   id         y
##    1:    1 22  7.095561  4    1  4.102052
##    2:    1 22  7.095561  4    2  6.001640
##    3:    1 22  7.095561  4    3  4.942132
##    4:    1 22  7.095561  4    4  4.062929
##    5:    1 22  7.095561  4    5  6.112493
##   ---                                    
## 1989:  100 23 13.073472  4 1989 14.111518
## 1990:  100 23 13.073472  4 1990 13.773178
## 1991:  100 23 13.073472  4 1991 13.763948
## 1992:  100 23 13.073472  4 1992 13.562904
## 1993:  100 23 13.073472  4 1993 12.891312
To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)