# Generating clustered data with marginal correlations

**ouR data generation**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A student is working on a project to derive an analytic solution to the problem of sample size determination in the context of cluster randomized trials and repeated individual-level measurement (something I’ve thought a little bit about before). Though the goal is an analytic solution, we do want confirmation with simulation. So, I was a little disheartened to discover that the routines I’d developed in `simstudy`

for this were not quite up to the task. I’ve had to quickly fix that, and the updates are available in the development version of `simstudy`

, which can be downloaded using *devtools::install_github(“kgoldfeld/simstudy”)*. While some of the changes are under the hood, I have added a new function, `genBlockMat`

, which I’ll describe here.

### Correlation in cluster randomized trials

The fundamental issue with cluster randomized trials is that outcomes for a group of patients in a specific cluster are possibly correlated; the degree to which this is true impacts both how much we “learn” from each individual. The more highly correlated individuals are, the less information we actually have. (In the extreme case, if there is perfect correlation, we really only have a sample of one from each group.)

When generating data and modeling associations, the structure of the correlation needs to reflect the context of the study design. The specific structure can depend on whether outcomes generally vary over time (so that patient outcomes within a cluster closer temporally might be more highly correlated than outcomes collected from patients far apart in time) and whether measurements are collected for the same individuals over time (you might expect the measurements of the same individual to be more highly correlated than measurements of two different individuals).

There are at least two ways to go about simulating correlated data from a cluster randomized trial. The first is to use a random effect to *induce* correlation. For example, a simple data generating process for a binary outcome with a treatment indicator and one covariate would start with a formulation like this:

\[ P(Y_{ij} = 1) = \pi_{ij}, \ \ \ Y_{ij} \in \{0,1\}\] \[ log \left( \frac{\pi_{ij}}{1-\pi_{ij}} \right) = \beta_0 + \beta_1 A_j + \beta_2X_i + b_j \]

where \(Y_{ij}\) is the outcome for individual \(i\) in cluster \(j\). (\(A\) is a treatment indicator and \(X\) is a covariate.) The key here is \(b_j\), which is a cluster level effect that is typically assumed to have a normal distribution N(0, \(\sigma_b^2)\). In a simulation, we would use use specific values to generate a probability \(\pi_{ij}\) for each; each of the \(\pi_{ij}\)’s within a cluster would be correlated by the presence of the cluster effect \(b_j\). It would follow that the \(Y_{ij}\)’s would also be correlated within cluster \(j\). We can call this the *conditional* data generation process, and we could use a mixed-effects regression model to recover the parameters. But we won’t do this here.

Instead, we can dispose of \(b_j\), like this:

\[ log \left( \frac{\pi_{ij}}{1-\pi_{ij}} \right) = \beta_0 + \beta_1 A_j + \beta_2X_i \]

As before, we would generate the \(\pi_{ij}\)’s, but the probabilities are going to be uncorrelated now (except of course the correlation due to randomization assignment, but this would be across clusters). The within-cluster correlation is directly introduced into the \(Y_{ij}\)’s by using using multivariate data generation process. If we were in the realm of normally distributed outcomes, we would use a multivariate normal data generating process \(MVN(\mathbf{\mu}, \Sigma)\), where \(\Sigma\) is a covariance matrix. (This could be done in `simstudy`

using `genCorData`

or `addCorData`

.) In this case, with a binary outcome, we need an analogous approach, which is implemented in the `simstudy`

functions `genCorGen`

and `addCorGen`

. To recover the parameters used to generate these data, a generalized estimating equations (GEE) model would be used; and rather than being *conditional*, the parameter estimates from this model will be *marginal*, just as the data generation process was.

### Generating data – multiple time periods, single individual measurement

OK – that is a bit more background than I intended (though probably not enough). Now onto the new function and simulations.

In the first example here, the outcomes are measured at three different periods, but an individual in a cluster is measured only once. In other words, the time periods include different sets of individuals.

If we have 3 time periods and 3 individuals in each time period, the within-cluster correlation between two individuals in the same time period is \(\alpha_1\), the correlation between individuals in adjacent time periods (periods 1&2 and periods 2&3) is \(\alpha_2\), and the correlation between individuals in time periods 1 and 3 would be \(\alpha_3\). The correlation structure for the cluster would be represented like this with each period represented in \(3 \times 3\) sub-blocks:

\[ \mathbf{R} = \left( \begin{matrix} 1 & \alpha_1 & \alpha_1 & \alpha_2 & \alpha_2 & \alpha_2 & \alpha_3 & \alpha_3 & \alpha_3 \\ \alpha_1 & 1 & \alpha_1 & \alpha_2 & \alpha_2 & \alpha_2 & \alpha_3 & \alpha_3 & \alpha_3 \\ \alpha_1 & \alpha_1 & 1 & \alpha_2 & \alpha_2 & \alpha_2 & \alpha_3 & \alpha_3 & \alpha_3 \\ \alpha_2 & \alpha_2 & \alpha_2 & 1 & \alpha_1 & \alpha_1 & \alpha_2 & \alpha_2 & \alpha_2 \\ \alpha_2 & \alpha_2 & \alpha_2 & \alpha_1 & 1 & \alpha_1 & \alpha_2 & \alpha_2 & \alpha_2 \\ \alpha_2 & \alpha_2 & \alpha_2 & \alpha_1 & \alpha_1 & 1 & \alpha_2 & \alpha_2 & \alpha_2 \\ \alpha_3 & \alpha_3 & \alpha_3 & \alpha_2 & \alpha_2 & \alpha_2 & 1 & \alpha_1 & \alpha_1 \\ \alpha_3 & \alpha_3 & \alpha_3 & \alpha_2 & \alpha_2 & \alpha_2 & \alpha_1 & 1 & \alpha_1 \\ \alpha_3 & \alpha_3 & \alpha_3 & \alpha_2 & \alpha_2 & \alpha_2 & \alpha_1 & \alpha_1 & 1 \end{matrix} \right ) \]

The overall correlation matrix for the full data set (assuming 5 clusters) is represented by block matrix \(\textbf{B}\) with

\[ \mathbf{B} = \left( \begin{matrix} \mathbf{R} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{R} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{R} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{R} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{R} \\ \end{matrix} \right ) \]

where \(\mathbf{0}\) is a \(9 \times 9\) matrix of \(0\)’s.

The new function `genBlockMat`

enables us to generate the \(R\) blocks (though currently it requires that the number of individuals per period per cluster are constant – I will relax that requirement in the future). Here are a couple of examples. In the first we are fixing \(\alpha_1 = 0.3\), \(\alpha_2 = 0.2\), and \(\alpha_3 = 0.1\):

library(simstudy) library(data.table) R <- genBlockMat(rho =c(0.3, 0.2, 0.1), nInds = 3 , nPeriods = 3) R ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] ## [1,] 1.0 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 ## [2,] 0.3 1.0 0.3 0.2 0.2 0.2 0.1 0.1 0.1 ## [3,] 0.3 0.3 1.0 0.2 0.2 0.2 0.1 0.1 0.1 ## [4,] 0.2 0.2 0.2 1.0 0.3 0.3 0.2 0.2 0.2 ## [5,] 0.2 0.2 0.2 0.3 1.0 0.3 0.2 0.2 0.2 ## [6,] 0.2 0.2 0.2 0.3 0.3 1.0 0.2 0.2 0.2 ## [7,] 0.1 0.1 0.1 0.2 0.2 0.2 1.0 0.3 0.3 ## [8,] 0.1 0.1 0.1 0.2 0.2 0.2 0.3 1.0 0.3 ## [9,] 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 1.0

In the second example, we specify the correlation using an auto-regressive structure with \(\alpha = 0.3\), so that \(\alpha_1 = \alpha =0.3\), \(\alpha_2 = \alpha^2 = 0.09\), and \(\alpha_3 = \alpha^3 = 0.027\):

genBlockMat(rho = 0.3, corstr = "ar1", nInds = 3 , nPeriods = 3) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] ## [1,] 1.000 0.300 0.300 0.09 0.09 0.09 0.027 0.027 0.027 ## [2,] 0.300 1.000 0.300 0.09 0.09 0.09 0.027 0.027 0.027 ## [3,] 0.300 0.300 1.000 0.09 0.09 0.09 0.027 0.027 0.027 ## [4,] 0.090 0.090 0.090 1.00 0.30 0.30 0.090 0.090 0.090 ## [5,] 0.090 0.090 0.090 0.30 1.00 0.30 0.090 0.090 0.090 ## [6,] 0.090 0.090 0.090 0.30 0.30 1.00 0.090 0.090 0.090 ## [7,] 0.027 0.027 0.027 0.09 0.09 0.09 1.000 0.300 0.300 ## [8,] 0.027 0.027 0.027 0.09 0.09 0.09 0.300 1.000 0.300 ## [9,] 0.027 0.027 0.027 0.09 0.09 0.09 0.300 0.300 1.000

Finally, we can specify using an exchangeable or compound symmetry structure with \(\alpha = 0.3\), so that \(\alpha_1 = \alpha_2 = \alpha_3 = \alpha = 0.3\) (i.e., there is constant between-individual correlation within each cluster over time):

genBlockMat(rho =0.3, corstr = "cs", nInds = 3 , nPeriods = 3) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] ## [1,] 1.0 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ## [2,] 0.3 1.0 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ## [3,] 0.3 0.3 1.0 0.3 0.3 0.3 0.3 0.3 0.3 ## [4,] 0.3 0.3 0.3 1.0 0.3 0.3 0.3 0.3 0.3 ## [5,] 0.3 0.3 0.3 0.3 1.0 0.3 0.3 0.3 0.3 ## [6,] 0.3 0.3 0.3 0.3 0.3 1.0 0.3 0.3 0.3 ## [7,] 0.3 0.3 0.3 0.3 0.3 0.3 1.0 0.3 0.3 ## [8,] 0.3 0.3 0.3 0.3 0.3 0.3 0.3 1.0 0.3 ## [9,] 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 1.0

Using the first target correlation matrix (\(\alpha_1 = 0.3\), \(\alpha_2 = 0.2\), and \(\alpha_3 = 0.1\)), we can go ahead and generate some data. The probability of outcome \(Y\) will be a function of cluster-level treatment \(A\), \(A \in \{0,1\}\) and individual-level covariate \(z\), a continuous measure centered closely around 0.

Here are the data definitions:

b0 <- -1.0; delta <- 1; number_inds = 3; defC <- defData(varname = "A", formula = "1;1", dist = "trtAssign") defI <- defDataAdd(varname = "z", formula = 0, variance = 0.10) defI <- defDataAdd(defI, varname = "p", formula = "..b0 + ..delta * A + .5*z", dist = "nonrandom", link = "logit")

The key to generating the data using the specified correlation matrix \(R\) is that the data need to be set up in the correct order; it will need to be sorted by cluster and period (and in this case the order of individuals does not matter). First, we generate the cluster level data, and period data for the clusters, and then the 3 individuals within each cluster and time period.

set.seed(1234) dc <- genData(n = 10, dtDefs = defC, id = "site") dc <- addPeriods(dtName = dc, nPeriods = 3, idvars = "site", perName = "period") dd <- genCluster(dtClust = dc, cLevelVar = "timeID", numIndsVar = number_inds, level1ID = "idnum") dd <- addColumns(defI, dd) setkey(dd, "site", "period", "idnum")

The correlated outcomes are generated using `addCorGen`

, where the correlation matrix \(R\) is applied. And, as I said before (I cannot reiterate this point enough), it is critical that the data are sorted in the correct order - by site and period:

dres <- addCorGen(dd, idvar = "site", corMatrix = R, dist = "binary", param1 = "p", cnames = "y", method = "ep") head(dres, n = 9) ## site period A timeID idnum z p y ## 1: 1 0 1 1 1 0.3109796 0.5387943 0 ## 2: 1 0 1 1 2 -0.1968381 0.4754151 1 ## 3: 1 0 1 1 3 -0.2313320 0.4711157 0 ## 4: 1 1 1 2 4 -0.1633853 0.4795882 1 ## 5: 1 1 1 2 5 -0.5536305 0.4312347 1 ## 6: 1 1 1 2 6 0.2783134 0.5347331 1 ## 7: 1 2 1 3 7 0.4332353 0.5539436 1 ## 8: 1 2 1 3 8 -0.5335796 0.4336954 0 ## 9: 1 2 1 3 9 -0.1984128 0.4752187 1

To convince myself that the data are indeed being generated intended, I am generating a large number of data sets using `addCorGen`

(with fixed individual level probabilities). The empirical within-cluster correlation can be estimated as the observed correlation of the repeated data sets (within each cluster). I am outputting the nine \(\pi\)’s for the individuals in a one of the clusters next to the mean observed outcome for that individual. And, more importantly, I am showing the within-cluster correlation calculated from the 2,500 observations. In both cases, the observed values are very much consistent with the true values used to generate the data. Though I am not showing you, this is the case for all 10 clusters.

reps <- lapply(1:2500, function(x) addCorGen(dd, idvar = "site", corMatrix = R, dist = "binary", param1 = "p", cnames = "y", method = "ep")) empir_corr <- function(cluster) { drep <- data.table::rbindlist(reps, idcol = "rep") drep <- drep[site == cluster, ] drep[, seq := 1:.N, keyby = rep] dmat <- as.matrix(dcast(drep, rep ~ seq, value.var = "y")[, -1]) mu <- cbind(true = round(dd[site == cluster, p], 2), observed = round(apply(dmat, 2, mean), 2)) R_hat <- round(cor(dmat), 1) return(list(mu = mu, R_hat = R_hat)) } empir_corr(cluster = 7) ## $mu ## true observed ## 1 0.32 0.33 ## 2 0.25 0.25 ## 3 0.27 0.28 ## 4 0.25 0.26 ## 5 0.28 0.27 ## 6 0.26 0.26 ## 7 0.27 0.26 ## 8 0.28 0.28 ## 9 0.31 0.31 ## ## $R_hat ## 1 2 3 4 5 6 7 8 9 ## 1 1.0 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 ## 2 0.3 1.0 0.3 0.2 0.2 0.2 0.1 0.1 0.1 ## 3 0.3 0.3 1.0 0.2 0.2 0.2 0.1 0.1 0.1 ## 4 0.2 0.2 0.2 1.0 0.3 0.3 0.2 0.2 0.2 ## 5 0.2 0.2 0.2 0.3 1.0 0.3 0.2 0.2 0.2 ## 6 0.2 0.2 0.2 0.3 0.3 1.0 0.2 0.2 0.2 ## 7 0.1 0.1 0.1 0.2 0.2 0.2 1.0 0.3 0.3 ## 8 0.1 0.1 0.1 0.2 0.2 0.2 0.3 1.0 0.3 ## 9 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 1.0

### Multiple time periods, repeated individual measurements

In this next example, I will simulate a case where there are two measurement periods but the same set of individuals is in each period (say a baseline and follow-up). In this case, the within-individual (between-period) correlation (\(\alpha^*\)) can be different from the between-individual, within-period correlation (\(\alpha_1\)) and the between-individual, between-period correlation (\(\alpha_2\)):

\[ \mathbf{R} = \left( \begin{matrix} 1 & \alpha_1 & \alpha_1 & \alpha^* & \alpha_2 & \alpha_2 \\ \alpha_1 & 1 & \alpha_1 & \alpha_2 & \alpha^* & \alpha_2 \\ \alpha_1 & \alpha_1 & 1 & \alpha_2 & \alpha_2 & \alpha^* \\ \alpha^* & \alpha_2 & \alpha_2 & 1 & \alpha_1 & \alpha_1 \\ \alpha_2 & \alpha^* & \alpha_2 & \alpha_1 & 1 & \alpha_1 \\ \alpha_2 & \alpha_2 & \alpha^* & \alpha_1 & \alpha_1 & 1 \\ \end{matrix} \right ) \]

`genBlockMat`

implements this with an additional argument *iRho* for the within-individual correlations. Here we are using *iRho* to set \(\alpha^* = 0.5\):

R <- genBlockMat(rho =c(0.3, .1), nInds = number_inds, nPeriods = 2, iRho = 0.5) R ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 1.0 0.3 0.3 0.5 0.1 0.1 ## [2,] 0.3 1.0 0.3 0.1 0.5 0.1 ## [3,] 0.3 0.3 1.0 0.1 0.1 0.5 ## [4,] 0.5 0.1 0.1 1.0 0.3 0.3 ## [5,] 0.1 0.5 0.1 0.3 1.0 0.3 ## [6,] 0.1 0.1 0.5 0.3 0.3 1.0

The data generation process here is slightly modified to accommodate this structure of individuals repeated across time periods. The clusters are generated first, as before, but then the individuals are created within the clusters. Once the individuals are created, then the time periods are introduced before generating the outcome (which in this case depends on the time period). Before generating the correlated outcomes using `addCorGen`

, it is now important that the data be sorted by cluster, period, *and* individual, as this is in line with the correlation structure.

b0 <- -1.0; b1 <- .5; delta <- -0.3; number_inds = 3; defC <- defData(varname = "A", formula = "1;1", dist = "trtAssign") defI <- defDataAdd(varname = "z", formula = 0, variance = 0.10) defI <- defDataAdd(defI, varname = "p", formula = "..b0 + ..b1* measure + ..delta * A * measure + z", dist = "nonrandom", link = "logit") set.seed(1234) dc <- genData(n = 10, dtDefs = defC, id = "site") dc <- genCluster(dtClust = dc, cLevelVar = "site", numIndsVar = number_inds, level1ID = "idnum") dd <- addPeriods(dtName = dc, nPeriods = 2, idvars = "idnum", perName = "measure") dd <- addColumns(defI, dd) setkey(dd, "site", "measure", "idnum") dres <- addCorGen(dd, idvar = "site", corMatrix = R, dist = "binary", param1 = "p", cnames = "y", method = "ep") head(dres, n = 6) ## idnum measure site A timeID z p y ## 1: 1 0 1 1 1 0.3109796 0.3342510 0 ## 2: 2 0 1 1 3 -0.2313320 0.2259484 0 ## 3: 3 0 1 1 5 -0.5536305 0.1745625 1 ## 4: 1 1 1 1 2 -0.1968381 0.2695635 0 ## 5: 2 1 1 1 4 -0.1633853 0.2762009 0 ## 6: 3 1 1 1 6 0.2783134 0.3724579 0

Again, we can confirm that the data generation process is working as anticipated by looking at the empirical means and correlation matrix based on 2,500 sets of outcomes:

reps <- lapply(1:2500, function(x) addCorGen(dd, idvar = "site", corMatrix = R, dist = "binary", param1 = "p", cnames = "y", method = "ep")) empir_corr(cluster = 2) ## $mu ## true observed ## 1 0.36 0.36 ## 2 0.23 0.23 ## 3 0.31 0.30 ## 4 0.21 0.21 ## 5 0.31 0.31 ## 6 0.27 0.27 ## ## $R_hat ## 1 2 3 4 5 6 ## 1 1.0 0.3 0.3 0.5 0.1 0.1 ## 2 0.3 1.0 0.3 0.1 0.5 0.1 ## 3 0.3 0.3 1.0 0.1 0.1 0.5 ## 4 0.5 0.1 0.1 1.0 0.3 0.3 ## 5 0.1 0.5 0.1 0.3 1.0 0.3 ## 6 0.1 0.1 0.5 0.3 0.3 1.0

### Varying cluster sizes for single period designs

I’ve alluded to the fact that `genBlockMat`

and `addCorGen`

cannot accommodate varying cluster sizes (yet), but if we need to generate clustered data with correlated outcomes and varying cluster sizes in a *single* period using `addCorGen`

, that *is* now possible. Here is a simple example to demonstrate how this works.

I am generating three clusters with sizes of 3, 4, and 2 individuals, respectively. The outcome is binary, and the probability of success varies slightly by cluster. First, we generate the clusters:

d1 <- defData(varname = "n", formula = "c(3, 4, 2)", dist = "nonrandom") d1 <- defData(d1, varname = "p", formula = 0.4, variance = 40, dist = "beta") set.seed(1234) ds <- genData(3, d1, id = "site") ds ## site n p ## 1: 1 3 0.2942948 ## 2: 2 4 0.4245886 ## 3: 3 2 0.5027532

And then the individual level data with the correlated outcomes:

dd <- genCluster(dtClust = ds, cLevelVar = "site", numIndsVar = "n", "id") addCorGen(dd, idvar = "site", rho =0.4, corstr = "cs", param1 = "p", dist = "binary", cnames = "y", method = "ep") ## site n p id y ## 1: 1 3 0.2942948 1 0 ## 2: 1 3 0.2942948 2 0 ## 3: 1 3 0.2942948 3 0 ## 4: 2 4 0.4245886 4 1 ## 5: 2 4 0.4245886 5 0 ## 6: 2 4 0.4245886 6 1 ## 7: 2 4 0.4245886 7 1 ## 8: 3 2 0.5027532 8 1 ## 9: 3 2 0.5027532 9 0

Once again, to confirm that the correlation structure is what we expect, I’ve generated 2,500 sets of outcomes. This time, I am showing the full observed correlation matrix where it is clear that the between cluster outcomes are independent of each other:

reps <- lapply(1:2500, function(x) addCorGen(dd, idvar = "site", rho =0.4, corstr = "cs", param1 = "p", dist = "binary", cnames = "y", method="ep") ) drep <- data.table::rbindlist(reps, idcol = "rep") dmat <- as.matrix(dcast(drep, rep ~ id, value.var = "y")[, -1]) mu <- cbind(true = round(dd[, p], 2), observed = round(apply(dmat, 2, mean), 2)) R_hat <- round(cor(dmat), 1) mu ## true observed ## 1 0.29 0.29 ## 2 0.29 0.30 ## 3 0.29 0.29 ## 4 0.42 0.43 ## 5 0.42 0.41 ## 6 0.42 0.43 ## 7 0.42 0.42 ## 8 0.50 0.50 ## 9 0.50 0.50 R_hat ## 1 2 3 4 5 6 7 8 9 ## 1 1.0 0.4 0.4 0.0 0.0 0.0 0.0 0.0 0.0 ## 2 0.4 1.0 0.4 0.0 0.0 0.0 0.0 0.0 0.0 ## 3 0.4 0.4 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ## 4 0.0 0.0 0.0 1.0 0.4 0.4 0.4 0.0 0.0 ## 5 0.0 0.0 0.0 0.4 1.0 0.4 0.4 0.0 0.0 ## 6 0.0 0.0 0.0 0.4 0.4 1.0 0.4 0.0 0.0 ## 7 0.0 0.0 0.0 0.4 0.4 0.4 1.0 0.0 0.0 ## 8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.4 ## 9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4 1.0

**leave a comment**for the author, please follow the link and comment on their blog:

**ouR data generation**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.