# Flexible simulation in simstudy with customized distribution functions

**ouR data generation**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Really, the only problem with the `simstudy`

package (😄) is that there is a hard limit to the possible probability distributions that are available (the current count is 15 – see here for a complete description). However, it turns out that there is more flexibility than first meets the eye, and we can easily accommodate a limitless number as long as you are willing to provide some extra code.

I am going to illustrate this with two examples, first by implementing a truncated normal distribution, and second by implementing the flexible non-linear data generating algorithm that I described last time.

Before we get going, here are the necessary libraries:

library(simstudy) library(data.table) library(msm) library(ggplot2) library(mgcv)

#### General concept

In the data definition step, it is possible to specify any valid `R`

function in the *formula* argument. If *dist* is specified as “nonrandom”, then `simstudy`

will generate data based on that function. (Yes, the specification as “nonrandom” is a bit awkward since we are defining a stochastic data generating process in this case; in future versions I plan to allow *dist* to be specified as “custom” to make this less dissonant.)

In this example, I want to be able to generate data from a truncated normal distribution. There is an existing function `rtnorm`

in the `msm`

package that I can take advantage here. What I have done is essentially create a wrapper function that makes a single draw from the truncated distribution with a specified mean, standard deviation, and pair of truncation bounds:

trunc_norm <- function(mean, sd, lower, upper) { rtnorm(n = 1, mean = mean, sd = sd, lower = lower, upper = upper) }

Now that `trunc_norm`

has been created, I am free to use this is in a data definition statement. And even more important, the call to `trunc_norm`

can depend on other variables; in this case, I have created binary variable *x* that will determine the upper and lower bounds of the distribution. When \(x=0\), the \(N(0, 3.5^2)\) distribution is truncated at -5 and 5, and when \(x=1\), the distribution is truncated at -8 and 8.

defI <- defData(varname = "x", formula = 0.5, dist = "binary") defI <- defData(defI, varname = "y", formula = "trunc_norm(mean = 0, sd = 3.5, lower = -5 + -3*x, upper = 5 + 3*x)", dist = "nonrandom")

The generated data appear to have the properties that we would expect:

dd <- genData(1000, defI)

#### Application to non-linear data generation

Last time, I described an approach to generate a variable \(y\) that has a non-linear response with respect to an input variable \(x\). At the end of that post, I created two functions, one of which can be referred to in the `defData`

statement to generate the data. (I plan on implementing these functions in `simstudy`

, but I was eager to get the concept out there in case any one has some suggestions or could use this feature right away.)

In the first step, I need to generate a smooth function by specifying a few points. I do this by calling `getNLfunction`

. (If you want the code for this, let me know, but I actually provided most of it last week.) The variable `nlf`

is an object that contains the function:

dpoints <- data.table(x = c(20, 30, 53, 65, 80), y = c(15, 44, 60, 55, 35)) nlf <- getNLfunction(dpoints)

The function `genNL`

makes predictions based on the `nlf`

object and adds a little Gaussian noise. We use the same approach as we did above for the truncated normal to generate different responses \(y\) based on the level of \(x\):

def <- defData(varname = "x", formula = "20;80", dist = "uniform") def <- defData(def, varname = "y", formula = "genNL(nf = ..nlf, x, sd = 10)", dist = "nonrandom") dd <- genData(300, def)

And if we introduce much less noise, we get much closer to the original underlying function specified by our points:

def <- defData(varname = "x", formula = "20;80", dist = "uniform") def <- defData(def, varname = "y", formula = "genNL(nf = ..nlf, x, sd = 0.5)", dist = "nonrandom") dd <- genData(300, def)

**leave a comment**for the author, please follow the link and comment on their blog:

**ouR data generation**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.