**ouR data generation**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In an earlier post, I described in a fair amount of detail an algorithm to generate correlated binary or Poisson data. I mentioned that I would be updating `simstudy`

with functions that would make generating these kind of data relatively painless. Well, I have managed to do that, and the updated package (version 0.1.3) is available for download from CRAN. There are now two additional functions to facilitate the generation of correlated data from *binomial*, *poisson*, *gamma*, and *uniform* distributions: `genCorGen`

and `addCorGen`

. Here’s a brief intro to these functions.

### Generate data based on values from existing data set

`addCorGen`

allows us to create correlated data from an existing data set, as one can already do using `addCorData`

, but with non-normal data. In the case of `addCorGen`

, the parameter(s) used to define the distribution is a field (or fields) in the data set. The correlated data are added to the existing data set. In the example below, we are going to generate three sets (Poisson, binary, and gamma) of correlated data with means that are a function of the variable `xbase`

, which varies by id.

First we define the data and generate a data set:

```
def <- defData(varname = "xbase", formula = 5, variance = 0.2,
dist = "gamma", id = "cid")
def <- defData(def, varname = "lambda", formula = "0.5 + 0.1 * xbase",
dist="nonrandom", link = "log")
def <- defData(def, varname = "p", formula = "-2.0 + 0.3 * xbase",
dist="nonrandom", link = "logit")
def <- defData(def, varname = "gammaMu", formula = "0.5 + 0.2 * xbase",
dist="nonrandom", link = "log")
def <- defData(def, varname = "gammaDis", formula = 1,
dist="nonrandom")
dt <- genData(10000, def)
dt
```

```
## cid xbase lambda p gammaMu gammaDis
## 1: 1 12.1128232 5.536056 0.8366960 18.588900 1
## 2: 2 4.9148342 2.695230 0.3715554 4.405998 1
## 3: 3 11.5550282 5.235712 0.8125261 16.626630 1
## 4: 4 3.0802596 2.243475 0.2542785 3.052778 1
## 5: 5 0.9767811 1.817893 0.1535577 2.004423 1
## ---
## 9996: 9996 6.0564517 3.021173 0.4543613 5.536100 1
## 9997: 9997 3.1298866 2.254636 0.2571119 3.083229 1
## 9998: 9998 12.4642670 5.734076 0.8505956 19.942505 1
## 9999: 9999 4.6559318 2.626345 0.3536072 4.183660 1
## 10000: 10000 3.4314285 2.323658 0.2747666 3.274895 1
```

The Poisson distribution has a single parameter, lambda:

```
dtX1 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 3, rho = 0.1,
corstr = "cs", dist = "poisson", param1 = "lambda",
cnames = "a, b, c")
dtX1[, .(cid, xbase, lambda, a, b, c)]
```

```
## cid xbase lambda a b c
## 1: 1 12.1128232 5.536056 4 6 7
## 2: 2 4.9148342 2.695230 2 4 1
## 3: 3 11.5550282 5.235712 5 6 4
## 4: 4 3.0802596 2.243475 1 3 1
## 5: 5 0.9767811 1.817893 2 1 0
## ---
## 9996: 9996 6.0564517 3.021173 1 3 3
## 9997: 9997 3.1298866 2.254636 2 3 1
## 9998: 9998 12.4642670 5.734076 4 6 8
## 9999: 9999 4.6559318 2.626345 2 3 5
## 10000: 10000 3.4314285 2.323658 0 0 3
```

The Bernoulli (binary) distribution has a single parameter, p:

```
dtX2 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 4, rho = .4,
corstr = "ar1", dist = "binary", param1 = "p")
dtX2[, .(cid, xbase, p, V1, V2, V3, V4)]
```

```
## cid xbase p V1 V2 V3 V4
## 1: 1 12.1128232 0.8366960 1 1 1 1
## 2: 2 4.9148342 0.3715554 0 0 0 0
## 3: 3 11.5550282 0.8125261 1 1 1 1
## 4: 4 3.0802596 0.2542785 0 1 0 0
## 5: 5 0.9767811 0.1535577 0 0 0 1
## ---
## 9996: 9996 6.0564517 0.4543613 0 0 0 0
## 9997: 9997 3.1298866 0.2571119 1 0 0 0
## 9998: 9998 12.4642670 0.8505956 0 1 1 1
## 9999: 9999 4.6559318 0.3536072 1 1 0 0
## 10000: 10000 3.4314285 0.2747666 1 0 1 1
```

And here is the the Gamma distribution, with its two parameters (mean and dispersion):

```
dtX3 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 3, rho = .4,
corstr = "cs", dist = "gamma",
param1 = "gammaMu", param2 = "gammaDis")
dtX3[, .(cid, xbase, gammaMu, gammaDis,
V1 = round(V1,2), V2 = round(V2,2), V3 = round(V3,2))]
```

```
## cid xbase gammaMu gammaDis V1 V2 V3
## 1: 1 12.1128232 18.588900 1 11.24 3.44 9.11
## 2: 2 4.9148342 4.405998 1 0.91 3.77 0.76
## 3: 3 11.5550282 16.626630 1 68.47 12.91 1.72
## 4: 4 3.0802596 3.052778 1 2.54 3.63 2.98
## 5: 5 0.9767811 2.004423 1 0.39 0.14 0.42
## ---
## 9996: 9996 6.0564517 5.536100 1 0.29 4.84 1.80
## 9997: 9997 3.1298866 3.083229 1 4.81 0.38 0.81
## 9998: 9998 12.4642670 19.942505 1 17.10 3.56 4.04
## 9999: 9999 4.6559318 4.183660 1 1.17 0.21 1.47
## 10000: 10000 3.4314285 3.274895 1 1.02 1.61 2.24
```

### Long form data

If we have data in *long* form (e.g. longitudinal data), the function will recognize the structure:

```
def <- defData(varname = "xbase", formula = 5, variance = .4,
dist = "gamma", id = "cid")
def <- defData(def, "nperiods", formula = 3,
dist = "noZeroPoisson")
def2 <- defDataAdd(varname = "lambda",
formula = "0.5 + 0.5 * period + 0.1 * xbase",
dist="nonrandom", link = "log")
dt <- genData(1000, def)
dtLong <- addPeriods(dt, idvars = "cid", nPeriods = 3)
dtLong <- addColumns(def2, dtLong)
dtLong
```

```
## cid period xbase nperiods timeID lambda
## 1: 1 0 6.693980 1 1 3.220053
## 2: 1 1 6.693980 1 2 5.308971
## 3: 1 2 6.693980 1 3 8.753013
## 4: 2 0 10.008645 2 4 4.485565
## 5: 2 1 10.008645 2 5 7.395447
## ---
## 2996: 999 1 6.753605 2 2996 5.340720
## 2997: 999 2 6.753605 2 2997 8.805359
## 2998: 1000 0 2.006781 4 2998 2.015119
## 2999: 1000 1 2.006781 4 2999 3.322369
## 3000: 1000 2 2.006781 4 3000 5.477661
```

```
### Generate the data
dtX3 <- addCorGen(dtOld = dtLong, idvar = "cid", nvars = 3,
rho = .6, corstr = "cs", dist = "poisson",
param1 = "lambda", cnames = "NewPois")
dtX3
```

```
## cid period xbase nperiods timeID lambda NewPois
## 1: 1 0 6.693980 1 1 3.220053 3
## 2: 1 1 6.693980 1 2 5.308971 5
## 3: 1 2 6.693980 1 3 8.753013 9
## 4: 2 0 10.008645 2 4 4.485565 2
## 5: 2 1 10.008645 2 5 7.395447 4
## ---
## 2996: 999 1 6.753605 2 2996 5.340720 6
## 2997: 999 2 6.753605 2 2997 8.805359 11
## 2998: 1000 0 2.006781 4 2998 2.015119 2
## 2999: 1000 1 2.006781 4 2999 3.322369 4
## 3000: 1000 2 2.006781 4 3000 5.477661 7
```

We can fit a generalized estimating equation (GEE) model and examine the coefficients and the working correlation matrix. As we would expect, they match closely to the data generating parameters:

```
geefit <- gee(NewPois ~ period + xbase, data = dtX3, id = cid,
family = poisson, corstr = "exchangeable")
```

`## Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27`

`## running glm to get initial regression estimate`

```
## (Intercept) period xbase
## 0.52045259 0.50354885 0.09746544
```

`round(summary(geefit)$working.correlation, 2)`

```
## [,1] [,2] [,3]
## [1,] 1.00 0.58 0.58
## [2,] 0.58 1.00 0.58
## [3,] 0.58 0.58 1.00
```

In the future, I plan on adding other distributions. Some folks have suggested the negative binomial distribution, which I will do. If you have other suggestions/requests, let me know.

**leave a comment**for the author, please follow the link and comment on their blog:

**ouR data generation**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.