# Bootstrapping clustered data

**R on Abhijit Dasgupta**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When evaluating the sampling variability of different statistics, I’ll often use the bootstrap procedure to resample my data, compute the statistic on each sample, and look at the distribution of the statistic over several bootstrap samples.

In principle, the bootstrap is straightforward to do. However, if you have correlated data (like repeated measures or longitudinal data or circular data), the unit of sampling no longer is the particular data point but the second-level unit within which the data are correlated; otherwise you break the correlation structure of the data by doing a naive bootstrap and distort the resultant distributions. This procedure is often called the *cluster bootstrap*.

Let’s fix ideas using a data analysis I’m currently doing. We’re looking at a particular measurement taken around a spinal joint every 5 degrees. These measures are correlated within person, since the measurements share the common spine. So to bootstrap our dataset, we have to bootstrap the people and not the individual measurements. A few rows of the data are below:

ID | Angle | Measure |
---|---|---|

16 | -90 | 1 |

16 | -85 | 1 |

16 | -80 | 1 |

16 | -75 | 1 |

16 | -70 | 1 |

16 | -65 | 1 |

The Measure variable varies from 0 to 1. The Angle variable varies between -90 and 90 by 5 degree increments.

Doing this computation is not difficult, but it becomes really straightforward using the `rsample`

package developed by the RStudio crew, specifically Max Kuhn and Hadley Wickham. I was recently in a workshop Max taught in DC, where he introduced me to the `rsample`

package, which, conveniently, has a `bootstraps`

function. Now, this function has an option `strata`

that can do stratified sampling. However, that is not the right tool, since that would sample from the individual measurements, just separately sampling by stratum. What we do need is to sample by the individuals.

Note that the `bootstraps`

function samples rows from a data.frame or tibble object. In our situation, we need to sample *groups* of rows corresponding to each unique ID. However, we can utilize list-columns in `tibbles`

to transform groups of rows into, effectively, single rows.

D <- d %>% nest(-ID) head(D) ## # A tibble: 6 x 2 ## ID data ## <int> <list> ## 1 16 <tibble [37 × 2]> ## 2 22 <tibble [37 × 2]> ## 3 38 <tibble [37 × 2]> ## 4 44 <tibble [37 × 2]> ## 5 30 <tibble [37 × 2]> ## 6 41 <tibble [37 × 2]>

Now, we can use `bootstraps`

on this new, compact tibble to sample by ID

library(rsample) set.seed(154234) bs <- bootstraps(D, times = 10) bs ## # Bootstrap sampling ## # A tibble: 10 x 2 ## splits id ## <list> <chr> ## 1 <S3: rsplit> Bootstrap01 ## 2 <S3: rsplit> Bootstrap02 ## 3 <S3: rsplit> Bootstrap03 ## 4 <S3: rsplit> Bootstrap04 ## 5 <S3: rsplit> Bootstrap05 ## 6 <S3: rsplit> Bootstrap06 ## 7 <S3: rsplit> Bootstrap07 ## 8 <S3: rsplit> Bootstrap08 ## 9 <S3: rsplit> Bootstrap09 ## 10 <S3: rsplit> Bootstrap10

You can read up about the

`rsplit`

object and how data is stored in this object here. Let’s look at one of these bootstrap samples:

as.tibble(bs$splits[[1]]) %>% arrange(ID) %>% head() ## # A tibble: 6 x 2 ## ID data ## <int> <list> ## 1 2 <tibble [37 × 2]> ## 2 7 <tibble [37 × 2]> ## 3 8 <tibble [37 × 2]> ## 4 9 <tibble [37 × 2]> ## 5 9 <tibble [37 × 2]> ## 6 9 <tibble [37 × 2]>

Notice that some ID’s are sampled multiple times, while others, not at all, which is the nature of bootstrap sampling.

If we want to assess the bootstrap distribution of the average Measure for each Angle, we can
just `unnest`

this tibble, and then assess the averages by Angle. This would give one bootstrap sample.

as.tibble(bs$splits[[1]]) %>% unnest() %>% group_by(Angle) %>% summarize(AvgMeasure = mean(Measure)) ## # A tibble: 37 x 2 ## Angle AvgMeasure ## <int> <dbl> ## 1 -90 0.596 ## 2 -85 0.557 ## 3 -80 0.539 ## 4 -75 0.532 ## 5 -70 0.595 ## 6 -65 0.530 ## 7 -60 0.495 ## 8 -55 0.480 ## 9 -50 0.439 ## 10 -45 0.383 ## # ... with 27 more rows

We can now use `purrr`

functions to get the bootstrap distribution over multiple bootstrap samples, and plot the sampled summaries against Angle.

library(purrr) library(ggplot2) bs <- bootstraps(D, times = 100) bs_AvgMeasure <- map(bs$splits, ~as.tibble(.) %>% unnest %>% group_by(Angle) %>% summarize(AvgMeasure = mean(Measure))) %>% bind_rows(.id = 'boots') ggplot(bs_AvgMeasure, aes(Angle, AvgMeasure, group = boots))+ geom_line(alpha= 0.3)+ theme_bw()

**leave a comment**for the author, please follow the link and comment on their blog:

**R on Abhijit Dasgupta**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.