Site icon R-bloggers

Simulation studies in R with the ‘parSim’ package

[This article was first published on r – psychonetrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

(featuring missing data analysis)

Simulation studies are absolutely vital for methodological work to be validated and tested in multiple settings. One simulation study is good, but more simulation studies are always better. In the context of network estimation for example, simulation studies are currently the only way to go in assessing the sample size needed, and a simulation study varying new conditions pointed out crucial problems in network estimation practices. Writing a simulation study in R is not very hard, but can take some time to set up.

My approach used to be to put all my simulation conditions in expand.grid to create a big data frame with conditions, then loop over this data frame to perform my simulations, storing the results in a new data frame, and finally left_join the results to have a big overview of simulation results.

I wrote the same code over and over again, and figured it would be easier to write a function to do this. I ended up using the function over and over again, and figured it might as well be an R package. This ended up as the very small and simple parSim package, which only contains a single function. I have written some examples on this on the Github page.

This blog post will exemplify how to use parSim for your own research!

Simple example: regression analysis

Let’s start with a very simple example. Suppose I have two variables: x and y, and the true linear regression relationship between these takes the following form:

Y = 0.5 * X + error

Thus, in a regression analysis there is no intercept (beta0 = 0) and only one slope (beta1 = 0.5). We can simulate this in R:

sampleSize <- 500
trueBeta <- 0.5
set.seed(123)
X <- rnorm(sampleSize)
Y <- trueBeta * X + rnorm(sampleSize)

Next, we can run a regression analysis and obtain estimates for the intercept and the slope:

fit <- lm(Y ~ X)
coefs <- coef(fit)
coefs

##   (Intercept)             X 
## -0.0004677688  0.4460270974

Let’s suppose we are interested in the absolute bias between our estimates and the true values. We can compute these as follows:

# Intercept bias (beta0 = 0):
interceptBias <- abs(coefs[1])

# Slope bias:
slopeBias <- abs(coefs[2] - trueBeta)

Finally, we can store these in a nice list to keep the results contained:

Results <- list(
  interceptBias = interceptBias,
  slopeBias = slopeBias
)

Performing the simulations in parSim

Using the parSim package in general takes the following form:

library("parSim")

parSim(

  # << ENTER SIMULATION CONDITIONS HERE >>

  reps = # << NUMBER OF REPITITIONS >>,
  write = # << TRUE/FALSE >>,
  name = # << NAME OF THE SIMULATION STUDY >>,
  nCores = # << NUMBER OF CORES TO USE >>,
  expression = {

    # << ENTER SIMULATION CODE HERE>>

    # << OUTPUT MUST BE LIST OR DATA FRAME >>>  
  }
)

The write argument will control if we return results to an object (as data frame) or write them to a text file. Let’s set this to TRUE to store our results (the name argument will be used for the filename). The reps argument controls how many repetitions you want per condition. For testing purposes, set this to 2 or 10 or so, and when you are happy with your setup you can increase this to 100 or even higher. The nCores argument controls how many computer cores are used. If this is higher than 1, you need to load packages in the simulation code. I recommend leaving this to 1 for testing purposes, and only increasing the number if you are confident about your simulation setup.

The way parSim mainly works is that we assign simulation conditions in the function as argument, and then can use the same names in the R expression. That is, if I enter:

sampleSize = c(50, 100, 250, 500, 1000)

in the simulation conditions, I can use the object sampleSize in the simulation code! It will be varied between 50, 100, 250, 500, and 100 then. The output must be a list, or a data frame with a row for additional conditions (e.g., if I want to compare two estimation procedures). This means that we can easily use our code above in parSim:

library("parSim")

parSim(
  ### SIMULATION CONDITIONS
  sampleSize = c(50, 100, 250, 500, 1000),

  reps = 100, # 100 repititions
  write = TRUE, # Writing to a file
  name = "parSim_regression", # Name of the file
  nCores = 1, # Number of cores to use
  expression = {
     # True beta coefficient (random):
    trueBeta <- rnorm(1)

    # Generate data:
    X <- rnorm(sampleSize)
    Y <- trueBeta * X + rnorm(sampleSize)

    # Run analysis:
    fit <- lm(Y ~ X)

    # Store coefficients:
    coefs <- coef(fit)

    # Intercept bias (beta0 = 0):
    interceptBias <- abs(coefs[1])

    # Slope bias:
    slopeBias <- abs(coefs[2] - trueBeta)

    # Results list:
    Results <- list(
      interceptBias = interceptBias,
      slopeBias = slopeBias
    )

    # Return:
    Results
  }
)

The function wrote a file I can read:

table <- read.table("parSim_regression.txt", header = TRUE)

which contains a simulation condition per row:

head(table)

##   sampleSize rep id interceptBias   slopeBias error errorMessage
## 1       1000  45  1   0.004440148 0.038575289 FALSE           NA
## 2       1000  51  2   0.014662430 0.007679153 FALSE           NA
## 3        500  10  3   0.016685096 0.019276944 FALSE           NA
## 4        500  87  4   0.028294452 0.038963745 FALSE           NA
## 5        100   9  5   0.041550819 0.090290917 FALSE           NA
## 6        100  92  6   0.029564444 0.015805679 FALSE           NA

I can easily use this with dplyr, tidyr and ggplot2:

library("ggplot2")
library("dplyr")
library("tidyr")

# Gather results:
table_gather <- table %>% gather(measure,value,interceptBias:slopeBias)

# Plot:
ggplot(table_gather, aes(x=factor(sampleSize), y=value, fill = measure)) + 
  geom_boxplot(outlier.size = 0.5,lwd=0.5,fatten=0.5) + 
  xlab("Sample Size") + 
  ylab("") + 
  theme_bw() + 
  theme( panel.grid.major.x = element_blank(),panel.grid.minor.x = element_blank()) +
  geom_vline(xintercept=seq(1.5, length(unique(table_gather$sampleSize))-0.5, 1),
             lwd=0.5, colour="black", alpha = 0.25) 


This nicely shows that the bias in parameter estimation goes down as a function of sample size!

Detailed example: network estimation with missing data

This being the psychonetrics blog, it would be the more logical to investigate the network estimation performance using psychonetrics rather than the performance of the lm function. Let’s investigate something that is actually quite a new topic of research: missing data analysis. To start, we need to generate data. We can use the bootnet package to generate a “true” network to simulate data under:

set.seed(123)
library("bootnet")
trueNet <- genGGM(10, nei = 2, propPositive = 0.8, p = 0.25)

This network consists of 10 nodes, with mostly positive edges and not unlike we see in the literature some times:

library("qgraph")
graph <- qgraph(trueNet, layout = "spring", theme = "colorblind")


Next, we need to generate data. We can use the bootnet package again for this: the ggmGenerator function generates a data generating function (I know… it is a bit confusing):

sampleSize <- 500
generator <- ggmGenerator()
Data <- generator(sampleSize, trueNet)

Next, we can add some missing data (10\% missing completely at random):

missing <- 0.1
for (i in 1:ncol(Data)){
  Data[runif(sampleSize) < missing,i] <- NA
}

Now, we can estimate our Gaussian graphical model, using the full information maximum likelihood estimator (FIML) in psychonetrics. To do this, we first need to create a model:

library("psychonetrics")
library("dplyr")
mod <- ggm(Data, estimator = "FIML")

This is a fully connected network model. It is not yet evaluated, so we need to run the model first:

mod <- mod %>% runmodel

Next, we can prune edges that are not significant. For example, we could use a false discovery rate (FDR) adjustment together with an alpha of 0.05. After pruning, psychonetrics will re-estimate the model, optimizing parameters taking into account that edges have been removed:

mod <- mod %>% prune(adjust = "fdr", alpha = 0.05, recursive = FALSE)

There are other ways to estimate a network, for example by using stepup to perform step-up estimation rather than step-down estimation, or by combining prune and stepup. A nice practice is to test out the differences between these algorithms in a simulation study yourself.

Now that we have a network model, we need to see if it is good. Let’s first retrieve it and plot it:

estNet <- getmatrix(mod, "omega")
graph <- qgraph(estNet, layout = graph$layout, theme = "colorblind")


This looks quite similar to the original network, but it would be best to use some statistics for this. I often use a small function compareNetworks (available here) for this:

source("compareNetworks.R")
results <- compareNetworks(trueNet, estNet)
results[c("sensitivity", "specificity", "correlation")]

## $sensitivity
## [1] 0.85
## 
## $specificity
## [1] 1
## 
## $correlation
## [1] 0.9343398

We see that our estimation is highly conservative (high specificity), but also picks up many edges (moderate to high sensitivity) with a good correspondence between edge weights (moderate to high correlation). This is good! Let’s combine all these codes in a nicely contained code to run this simulation:

# Conditions:
sampleSize <- 500
missing <- 0.1

### Start simulation code ###
# Packages:
library("psychonetrics")
library("dplyr")
library("bootnet")

# Function needed:
source("compareNetworks.R")

# Generate true network:
trueNet <- genGGM(10, nei = 2, propPositive = 0.8, p = 0.25)

# Generate data:
generator <- ggmGenerator()
Data <- generator(sampleSize, trueNet)

# Add missings:
for (i in 1:ncol(Data)){
  Data[runif(sampleSize) < missing,i] <- NA
}

# Estimate model (FDR prune):
mod <- ggm(Data, estimator = "FIML") %>% 
  runmodel %>% 
  prune(adjust = "fdr", alpha = 0.05, recursive = FALSE)

# Extract network:
estNet <- getmatrix(mod, "omega")

# Compare to true network:
results <- compareNetworks(trueNet, estNet)

Pretty!

The resulting object results is already a list with values to return, so we don’t have to change much to this code to put it in parSim! The parSim code for performing this simulation study, varying sample size and missingness and repeating each condition 100 times (using 8 computer cores) is available here. It takes some time to run (about half an hour on a good computer for me). You can download results from this simulation study here if you do not want to wait for it to run.

Now, we can plot the results using the useful tidyverse packages again:

library("ggplot2")
library("dplyr")
library("tidyr")

# Read results:
table <- read.table("missingdata_sims_1.txt", header = TRUE)

# Label missings nicer:
table$missingFactor <- factor(table$missing, 
     levels = sort(unique(table$missing)),
     labels = paste0("Missingness: ",sort(unique(table$missing))))

# Gather results:
table_gather <- table %>% gather(measure,value,sensitivity:correlation)

# Plot:
ggplot(table_gather, aes(x=factor(sampleSize), y=value)) + 
   facet_grid(missingFactor ~ measure) + 
  geom_boxplot(outlier.size = 0.5,lwd=0.5,fatten=0.5) + 
  xlab("Sample Size") + 
  ylab("") + 
  theme_bw() + 
  theme( panel.grid.major.x = element_blank(),panel.grid.minor.x = element_blank()) +
  geom_vline(xintercept=seq(1.5, length(unique(table_gather$sampleSize))-0.5, 1),
             lwd=0.5, colour="black", alpha = 0.25) 


It seems that FIML estimation coupled with FDR pruning works very well for handling missing data!

We can extend our simulation study, and compare our results to, for example, mice imputation coupled with the ggmModSelect algorithm. The parSim code for such a simulation study is available here and results from a run on my computer are available here. Using almost the exact same code as above (I only set fill = method in aes()) gives the following results:

This clearly shows that ggmModSelect loses its specificity (detects more false edges) when we introduce missingness.

Tips for performing simulation studies

To conclude this blog post, here are some tips when setting up simulation studies:

That’s all for this blog post! Good luck with setting up your own simulation study!

To leave a comment for the author, please follow the link and comment on their blog: r – psychonetrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.