# It can be easy to explore data generating mechanisms with the simstudy package

**ouR data generation**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I learned statistics and probability by simulating data. Sure, I did the occasional proof, but I never believed the results until I saw it in a simulation. I guess I have it backwards, but I that’s just the way I am. And now that I am a so-called professional, I continue to use simulation to understand models, to do sample size estimates and power calculations, and of course to teach. Sure – I’ll use the occasional formula when one exists, but I always feel the need to check it with simulation. It’s just the way I am.

Since I found myself constantly setting up simulations, over time I developed ways to make the process a bit easier. Those processes turned into a package, which I called simstudy, which obviously means *simulating study data*. The purpose here in this blog entyr is to introduce the basic idea behind simstudy, and provide a relatively brief example that actually comes from a question a user posed about generating correlated longitudinal data.

## The basic idea

Simulation using simstudy has two primary steps. First, the user defines the data elements of a data set either in an external csv file or internally through a set of repeated definition statements. Second, the user generates the data, using these definitions. Data generation can be as simple as a cross-sectional design or prospective cohort design, or it can be more involved, extending to allow simulators to generate observed or randomized *treatment assignment/exposures*, *survival* data, *longitudinal/panel* data, *multi-level/hierarchical* data, datasets with *correlated variables* based on a specified covariance structure, and to data sets with *missing* data based on a variety of missingness patterns.

The key to simulating data in simstudy is the creation of series of data defintion tables that look like this:

varname | formula | variance | dist | link |
---|---|---|---|---|

nr | 7 | 0 | nonrandom | identity |

x1 | 10;20 | 0 | uniform | identity |

y1 | nr + x1 * 2 | 8 | normal | identity |

y2 | nr – 0.2 * x1 | 0 | poisson | log |

xCat | 0.3;0.2;0.5 | 0 | categorical | identity |

g1 | 5+xCat | 1 | gamma | log |

a1 | -3 + xCat | 0 | binary | logit |

Here’s the code that is used to generate this definition, which is stored as a data.table :

def <- defData(varname = "nr", dist = "nonrandom", formula = 7, id = "idnum") def <- defData(def, varname = "x1", dist = "uniform", formula = "10;20") def <- defData(def, varname = "y1", formula = "nr + x1 * 2", variance = 8) def <- defData(def, varname = "y2", dist = "poisson", formula = "nr - 0.2 * x1", link = "log") def <- defData(def, varname = "xCat", formula = "0.3;0.2;0.5", dist = "categorical") def <- defData(def, varname = "g1", dist = "gamma", formula = "5+xCat", variance = 1, link = "log") def <- defData(def, varname = "a1", dist = "binary", formula = "-3 + xCat", link = "logit")

To create a simple data set based on these definitions, all one needs to do is execute a single `genData`

command. In this example, we generate 500 records that are based on the definition in the `def`

table:

dt <- genData(500, def) dt ## idnum nr x1 y1 y2 xCat g1 a1 ## 1: 1 7 11.19401 26.92518 100 1 921.48860 0 ## 2: 2 7 11.44471 29.41959 112 1 78.28623 0 ## 3: 3 7 16.69289 44.03212 41 3 2776.72932 1 ## 4: 4 7 11.15279 30.11492 103 2 3408.90636 1 ## 5: 5 7 18.06209 40.93295 19 3 8528.21763 0 ## --- ## 496: 496 7 11.84040 30.91384 101 3 2848.82490 0 ## 497: 497 7 17.82783 36.51813 37 1 1047.21862 1 ## 498: 498 7 16.66917 42.17807 28 2 1219.15208 0 ## 499: 499 7 18.53819 47.81427 25 1 387.60547 0 ## 500: 500 7 12.44773 34.05384 79 3 1692.40276 1

There’s a lot more functionality in the package, and I’ll be writing about that in the future. But here, I just want give a little more introduction by way of an example that came in from around the world a couple of days ago. (I’d say the best thing about building a package is hearing from folks literally all over the world and getting to talk to them about statistics and R. It is really incredible to be able to do that.)

## Going a bit further: simulating a prosepctive cohort study with repeated measures

The question was, can we simulate a study with two arms, say a control and treatment, with repeated measures at three time points: baseline, after 1 month, and after 2 months? Of course.

This was what I sent back to my correspondent:

# Define the outcome ydef <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T", variance = 3) # Generate a 'blank' data.table with 24 observations and assign them to # groups set.seed(1234) indData <- genData(24) indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T") # Create a longitudinal data set of 3 records for each id longData <- addPeriods(indData, nPeriods = 3, idvars = "id") longData <- addColumns(dtDefs = ydef, longData) longData[, `:=`(T, factor(T, labels = c("No", "Yes")))] # Let's look at the data ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T, group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")

If we generate a data set based on 1,000 indviduals and estimate a linear regression model we see that the parameter estimates are quite good. However, my correspondent wrote back saying she wanted correlated data, which makes sense. We can see from the alpha estimate of approximately 0.02 (at the bottom of the output), we don’t have much correlation:

# Fit a GEE model to the data fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"), data = longData, id = id, corstr = "exchangeable") summary(fit) ## ## Call: ## geeglm(formula = Y ~ factor(T) + period + factor(T) * period, ## family = gaussian(link = "identity"), data = longData, id = id, ## corstr = "exchangeable") ## ## Coefficients: ## Estimate Std.err Wald Pr(>|W|) ## (Intercept) 4.98268 0.07227 4753.4 <2e-16 *** ## factor(T)Yes 1.48555 0.10059 218.1 <2e-16 *** ## period 2.53946 0.05257 2333.7 <2e-16 *** ## factor(T)Yes:period 3.51294 0.07673 2096.2 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Estimated Scale Parameters: ## Estimate Std.err ## (Intercept) 2.952 0.07325 ## ## Correlation: Structure = exchangeable Link = identity ## ## Estimated Correlation Parameters: ## Estimate Std.err ## alpha 0.01737 0.01862 ## Number of clusters: 1000 Maximum cluster size: 3

**leave a comment**for the author, please follow the link and comment on their blog:

**ouR data generation**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.