# A closer look at replicate() and purrr::map() for simulations

**Very statisticious on Very statisticious**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve done a couple of posts so far on simulations, here and here, where I demonstrate how to build a function for simulating data from a defined linear model and then explore long-run behavior of models fit to the simulated datasets. The focus of those posts was on the general simulation process, and I didn’t go into much detail on the specific R code. In this post I’ll focus in on the code I use for repeatedly simulating data and extracting output, specifically talking about the function `replicate()`

and the *map* family of functions from package **purrr**.

## Table of Contents

# R packages

I’ll use package **purrr** for looping, **dplyr** for any data manipulation, and **ggplot2** for plotting. I’ll also use (but not load), package **broom** for extracting output from models.

library(purrr) # v. 0.2.5 suppressPackageStartupMessages( library(dplyr) ) # v. 0.7.5 library(ggplot2) # v 2.2.1

# The replicate() function

The `replicate()`

function is a member of the *apply* family of functions in base R.

Specifically, from the documentation:

`replicate`

is a wrapper for the common use of`sapply`

for repeated evaluation of an expression (which will usually involve random number generation).

Notice the documentation mentions *repeated evaluations* and that the use of `replicate()`

involves *random number generation*. Those are primary parts of the simulations I do. While I don’t actually know the *apply* family of functions very well, I use `replicate()`

a lot (although also see `purrr::rerun()`

). Using `replicate()`

is an alternative to building a `for()`

loop to repeatedly simulate new values.

The `replicate()`

function takes three arguments:

`n`

, which is the number of replications to perform. This is where I set the number of simulations I want to run.`expr`

, the expression that should be run repeatedly. I’ve only ever used a function here.`simplify`

, which controls the type of output the results of`expr`

are saved into. Use`simplify = FALSE`

to get vectors saved into a list instead of in an array.

## Simple example of replicate()

Let’s say I wanted to simulate some values from a normal distribution, which I can do using the `rnorm()`

function. Below I’ll simulate five values from a normal distribution with a mean of 0 and a standard deviation of 1 (which are the defaults for `mean`

and `sd`

arguments, respectively).

Since I’m going generate random numbers I’ll set the seed so anyone following along at home will see the same values.

set.seed(16) rnorm(5, mean = 0, sd = 1)

# [1] 0.4764134 -0.1253800 1.0962162 -1.4442290 1.1478293

Using `rnorm()`

directly gives me a single set of simulated values. How do I simulate 5 values from this same distribution multiple times? This is where `replicate()`

comes in. It allows me to run the function I put in `expr`

exactly `n`

times.

Here I’ll ask for three runs of 5 values each. Notice I use `simplify = FALSE`

to get a list as output.

The output below is a list of three vectors. Each vector is from a unique run of the function, so contains five random numbers drawn from the normal distribution with a mean of 0 and standard deviation of 1.

set.seed(16) replicate(n = 3, rnorm(5, 0, 1), simplify = FALSE )

# [[1]] # [1] 0.4764134 -0.1253800 1.0962162 -1.4442290 1.1478293 # # [[2]] # [1] -0.46841204 -1.00595059 0.06356268 1.02497260 0.57314202 # # [[3]] # [1] 1.8471821 0.1119334 -0.7460373 1.6582137 0.7217206

Note if I don’t use `simplify = FALSE`

I will get a matrix of values instead of a list. Each column in the matrix is the output from one run of the function. In this case there will be three columns in the output, one for each run, and 5 rows. This can be a useful output type for simulations. I focus on list output throughout the rest of this post only because that’s what I have been using recently for simulations.

set.seed(16) replicate(n = 3, rnorm(5, 0, 1) )

# [,1] [,2] [,3] # [1,] 0.4764134 -0.46841204 1.8471821 # [2,] -0.1253800 -1.00595059 0.1119334 # [3,] 1.0962162 0.06356268 -0.7460373 # [4,] -1.4442290 1.02497260 1.6582137 # [5,] 1.1478293 0.57314202 0.7217206

## An equivalent for() loop example

A `for()`

loop can be used in place of `replicate()`

for simulations. With time and practice I’ve found `replicate()`

to be much more convenient in terms of writing the code. However, in my experience some folks find `for()`

loops intuitive when they are starting out in R. I think it’s because `for()`

loops are more explicit on the looping process: the user can see that `i`

is looped over and the output for each `i`

iteration is saved into the output object because the code is written out explicitly.

In my example I’ll save the output of each iteration of the loop into a list called `list1`

. I initialize this as an empty list prior to starting the loop. To match what I did with `replicate()`

I do three iterations of the loop (`i in 1:3`

), drawing 5 values via `rnorm()`

each time.

The result is identical to my `replicate()`

code above. It took a little more code to do it but the process is very clear since it is explicitly written out.

set.seed(16) list1 = list() # Make an empty list to save output in for (i in 1:3) { # Indicate number of iterations with "i" list1[[i]] = rnorm(5, 0, 1) # Save output in list for each iteration } list1

# [[1]] # [1] 0.4764134 -0.1253800 1.0962162 -1.4442290 1.1478293 # # [[2]] # [1] -0.46841204 -1.00595059 0.06356268 1.02497260 0.57314202 # # [[3]] # [1] 1.8471821 0.1119334 -0.7460373 1.6582137 0.7217206

## Using replicate() on a user-made function

When I do simulations to explore the behavior of linear models under different scenarios I will create a function to simulate the data and fit the model. For example, here’s a function I used in an earlier blog post to simulate data from and then fit a two group linear model.

twogroup_fun = function(nrep = 10, b0 = 5, b1 = -2, sigma = 2) { ngroup = 2 group = rep( c("group1", "group2"), each = nrep) eps = rnorm(ngroup*nrep, 0, sigma) growth = b0 + b1*(group == "group2") + eps growthfit = lm(growth ~ group) growthfit }

The output is a model fit to data generated from the fixed and random (residual) effects.

twogroup_fun()

# # Call: # lm(formula = growth ~ group) # # Coefficients: # (Intercept) groupgroup2 # 4.686 -1.267

To explore the long-run behavior in my simulated scenario I will repeat the data generation and model fitting many times using `replicate()`

. The result is a list of fitted models. I’ll run the function 5 times and save the result as `sim_lm`

to use throughout the next section on `map()`

.

sim_lm = replicate(5, twogroup_fun(), simplify = FALSE ) length(sim_lm)

# [1] 5

# Using purrr::map() for looping through lists

So I have a list of fitted models from `replicate()`

; now what?

The `replicate()`

function was about repeatedly running a function. Once I have the repeated runs I can explore the long-run behavior of some statistic by extracting value(s) from the resulting models. This involves looping through the list of models.

Looping through the list can be done using a `for()`

loop, but I prefer to use functions that do the looping without all the typing. In particular, these days I use the *map* family of functions from the **purrr** package to loop through lists. Before **purrr** I primarily used `lapply()`

(the only other *apply* family function that I know 😆).

The `map()`

function takes a list as input and puts the output into a list of the same length. The first argument to `map()`

is the list to loop through and the second argument is the function to apply to each element of the list.

For example, I can pull out the coefficients of each model in my 5-run simulation by looping through `sim_lm`

and applying the `coef()`

function to each list element.

map(sim_lm, coef)

# [[1]] # (Intercept) groupgroup2 # 5.189474 -1.715602 # # [[2]] # (Intercept) groupgroup2 # 4.670188 -1.965463 # # [[3]] # (Intercept) groupgroup2 # 5.231922 -2.589953 # # [[4]] # (Intercept) groupgroup2 # 6.285158 -3.195090 # # [[5]] # (Intercept) groupgroup2 # 4.3296875 -0.9724314

## Other variants of map() for non-list outputs

There are many variants of `map()`

that are convenient for saving results into something other than a list. For example, if I am going to extract a single numeric value from each model, such as \(R^2\), I might want the output to be a numeric vector instead of a list. I can use `map_dbl()`

for this.

The unadjusted \(R^2\) from a model fit with `lm()`

can be pulled from the model `summary()`

output. The code looks like: `summary(model)$r.squared`

,

where “model” is a fitted model.

So getting \(R^2\) involves extracting a value after applying a function, which isn’t quite as straightforward as applying a single function to every model in the list like I did with `coef()`

. This gives me a chance to demonstrate the formula coding styling available in *map* functions. In formula coding a tilde (`~`

) goes in front of the function and `.x`

refers to the list element.

map_dbl(sim_lm, ~summary(.x)$r.squared)

# [1] 0.22823549 0.16199867 0.25730022 0.38591045 0.06375695

If you don’t like the formula style you can use an anonymous function inside *map* functions, where the function argument is used to refer to the list element.

map_dbl(sim_lm, function(x) summary(x)$r.squared)

# [1] 0.22823549 0.16199867 0.25730022 0.38591045 0.06375695

For data.frame output we can use `map_dfr()`

for row binding or *stacking* results together into a single data.frame.

Estimated coefficients, their standard errors, and their statistical tests from models fit with `lm()`

can be extracted into a tidy data.frame using `broom::tidy()`

. Looping through the results and doing this for each model via `map_dfr()`

will put the output in one data.frame instead of storing the individual data.frames for each model as one element of a list.

map_dfr(sim_lm, broom::tidy)

# # A tibble: 10 x 5 # term estimate std.error statistic p.value # <chr> <dbl> <dbl> <dbl> <dbl> # 1 (Intercept) 5.19 0.526 9.87 0.0000000109 # 2 groupgroup2 -1.72 0.744 -2.31 0.0331 # 3 (Intercept) 4.67 0.745 6.27 0.00000654 # 4 groupgroup2 -1.97 1.05 -1.87 0.0785 # 5 (Intercept) 5.23 0.733 7.13 0.00000120 # 6 groupgroup2 -2.59 1.04 -2.50 0.0224 # 7 (Intercept) 6.29 0.672 9.36 0.0000000246 # 8 groupgroup2 -3.20 0.950 -3.36 0.00346 # 9 (Intercept) 4.33 0.621 6.97 0.00000164 # 10 groupgroup2 -0.972 0.878 -1.11 0.283

The `map_dfr()`

function has an additional argument, `.id`

, which can be used to store the list names (if the original list had names) or add the list index to the output (if it didn’t have names). I’m using a list that has no names, so each unique model output will be assigned its index number if I use the `.id`

argument. The name of the new column is given as a string to `.id`

.

map_dfr(sim_lm, broom::tidy, .id = "model")

# # A tibble: 10 x 6 # model term estimate std.error statistic p.value # <chr> <chr> <dbl> <dbl> <dbl> <dbl> # 1 1 (Intercept) 5.19 0.526 9.87 0.0000000109 # 2 1 groupgroup2 -1.72 0.744 -2.31 0.0331 # 3 2 (Intercept) 4.67 0.745 6.27 0.00000654 # 4 2 groupgroup2 -1.97 1.05 -1.87 0.0785 # 5 3 (Intercept) 5.23 0.733 7.13 0.00000120 # 6 3 groupgroup2 -2.59 1.04 -2.50 0.0224 # 7 4 (Intercept) 6.29 0.672 9.36 0.0000000246 # 8 4 groupgroup2 -3.20 0.950 -3.36 0.00346 # 9 5 (Intercept) 4.33 0.621 6.97 0.00000164 # 10 5 groupgroup2 -0.972 0.878 -1.11 0.283

Further arguments to the function used within `map()`

can be passed as additional arguments. For example, I can add confidence intervals for estimated coefficients when using the `tidy.lm()`

function via `conf.int = TRUE`

. If I want to get confidence intervals for all models I add this as an additional argument in `map_dfr()`

.

map_dfr(sim_lm, broom::tidy, conf.int = TRUE)

# # A tibble: 10 x 7 # term estimate std.error statistic p.value conf.low conf.high # <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 (Intercept) 5.19 0.526 9.87 0.0000000109 4.08 6.29 # 2 groupgroup2 -1.72 0.744 -2.31 0.0331 -3.28 -0.153 # 3 (Intercept) 4.67 0.745 6.27 0.00000654 3.10 6.24 # 4 groupgroup2 -1.97 1.05 -1.87 0.0785 -4.18 0.248 # 5 (Intercept) 5.23 0.733 7.13 0.00000120 3.69 6.77 # 6 groupgroup2 -2.59 1.04 -2.50 0.0224 -4.77 -0.411 # 7 (Intercept) 6.29 0.672 9.36 0.0000000246 4.87 7.70 # 8 groupgroup2 -3.20 0.950 -3.36 0.00346 -5.19 -1.20 # 9 (Intercept) 4.33 0.621 6.97 0.00000164 3.02 5.63 # 10 groupgroup2 -0.972 0.878 -1.11 0.283 -2.82 0.873

The *map* family of functions can easily be used with pipes as one step in a chain of functions. I can, for example, take the estimates I get using `broom::tidy()`

, pull out the estimated intercepts, and then plot a histogram of those estimates. I’ll use functions from packages **dplyr** and **ggplot2** for this.

You can see all the steps in the pipe chain below. I loop through `sim_lm`

using `map_dfr()`

to extract the coefficients from each element of the list and output a data.frame of results. I use `dplyr::filter()`

to keep only the rows with estimated intercepts and then plot a histogram of these estimates for the whole simulation with `ggplot2::qplot()`

.

sim_lm %>% map_dfr(broom::tidy) %>% filter(term == "(Intercept)") %>% qplot(x = estimate, data = ., geom = "histogram")

# `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There are more variants of `map()`

that you might find useful, both for simulations and in other iterative work. See the documentation for `map()`

(`?map`

) to see all of them along with additional examples.

# Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(purrr) # v. 0.2.5 suppressPackageStartupMessages( library(dplyr) ) # v. 0.7.5 library(ggplot2) # v 2.2.1 set.seed(16) rnorm(5, mean = 0, sd = 1) set.seed(16) replicate(n = 3, rnorm(5, 0, 1), simplify = FALSE ) set.seed(16) replicate(n = 3, rnorm(5, 0, 1) ) set.seed(16) list1 = list() # Make an empty list to save output in for (i in 1:3) { # Indicate number of iterations with "i" list1[[i]] = rnorm(5, 0, 1) # Save output in list for each iteration } list1 twogroup_fun = function(nrep = 10, b0 = 5, b1 = -2, sigma = 2) { ngroup = 2 group = rep( c("group1", "group2"), each = nrep) eps = rnorm(ngroup*nrep, 0, sigma) growth = b0 + b1*(group == "group2") + eps growthfit = lm(growth ~ group) growthfit } twogroup_fun() sim_lm = replicate(5, twogroup_fun(), simplify = FALSE ) length(sim_lm) map(sim_lm, coef) map_dbl(sim_lm, ~summary(.x)$r.squared) map_dbl(sim_lm, function(x) summary(x)$r.squared) map_dfr(sim_lm, broom::tidy) map_dfr(sim_lm, broom::tidy, .id = "model") map_dfr(sim_lm, broom::tidy, conf.int = TRUE) sim_lm %>% map_dfr(broom::tidy) %>% filter(term == "(Intercept)") %>% qplot(x = estimate, data = ., geom = "histogram")

**leave a comment**for the author, please follow the link and comment on their blog:

**Very statisticious on Very statisticious**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.