An example of base::split() for looping through groups

Very statisticious on Very statisticious

2 years ago

[This article was first published on Very statisticious on Very statisticious, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently had a question from a client about the simplest way to subset a data.frame and apply a function to each subset. “Simplest” could mean many things, of course, since what is simple for one person could appear very difficult to another. In this specific case I suggested using base::split() as a possible option since it is one I find fairly approachable.

I turns out I don’t have a go-to example for how to get started with a split() approach. So here’s a quick blog post about it! ????

Load R packages

I’ll load purrr for looping through lists.

library(purrr) # 0.3.3

A dataset with groups

I made a small dataset to use with split(). The id variable contains the group information. There are three groups, a, b, and c, with 10 observations per group. There are also two numeric variables, var1 and var2.

dat = structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), 
    var1 = c(4, 2.7, 3.4, 2.7, 4.6, 2.9, 2.2, 4.5, 4.6, 2.4, 
    3, 3.8, 2.5, 4, 3.6, 2.7, 4.5, 4.1, 4.2, 2.2, 4.9, 4.4, 3.6, 
    3.3, 2.7, 3.9, 4.9, 4.9, 4.3, 3.4), var2 = c(6, 22.3, 19.4, 
    22.8, 18.6, 14.2, 10.9, 22.7, 22.4, 11.7, 6, 13.3, 12.5, 
    6.3, 13.6, 20.5, 23.6, 10.9, 8.9, 20.9, 23.7, 15.9, 22.1, 
    11.6, 22, 17.7, 21, 20.8, 16.7, 21.4)), class = "data.frame", row.names = c(NA, 
-30L))

head(dat)
#   id var1 var2
# 1  a  4.0  6.0
# 2  a  2.7 22.3
# 3  a  3.4 19.4
# 4  a  2.7 22.8
# 5  a  4.6 18.6
# 6  a  2.9 14.2

Create separate data.frames per group

If the goal is to apply a function to each dataset in each group, we need to pull out a dataset for each id. One approach to do this is to make a subset for each group and then apply the function of interest to the subset. A classic approach would be to do the subsetting within a for() loop.

This is a situation where I find split() to be really convenient. It splits the data by a defined group variable so we don’t have to subset things manually.

The output from split() is a list. If I split a dataset by groups, each element of the list will be a data.frame for one of the groups. Note the group values are used as the names of the list elements. I find the list-naming aspect of split() handy for keeping track of groups in subsequent steps.

Here’s an example, where I split dat by the id variable.

dat_list = split(dat, dat$id)
dat_list
# $a
#    id var1 var2
# 1   a  4.0  6.0
# 2   a  2.7 22.3
# 3   a  3.4 19.4
# 4   a  2.7 22.8
# 5   a  4.6 18.6
# 6   a  2.9 14.2
# 7   a  2.2 10.9
# 8   a  4.5 22.7
# 9   a  4.6 22.4
# 10  a  2.4 11.7
# 
# $b
#    id var1 var2
# 11  b  3.0  6.0
# 12  b  3.8 13.3
# 13  b  2.5 12.5
# 14  b  4.0  6.3
# 15  b  3.6 13.6
# 16  b  2.7 20.5
# 17  b  4.5 23.6
# 18  b  4.1 10.9
# 19  b  4.2  8.9
# 20  b  2.2 20.9
# 
# $c
#    id var1 var2
# 21  c  4.9 23.7
# 22  c  4.4 15.9
# 23  c  3.6 22.1
# 24  c  3.3 11.6
# 25  c  2.7 22.0
# 26  c  3.9 17.7
# 27  c  4.9 21.0
# 28  c  4.9 20.8
# 29  c  4.3 16.7
# 30  c  3.4 21.4

Looping through the list

Once the data are split into separate data.frames per group, we can loop through the list and apply a function to each one using whatever looping approach we prefer.

For example, if I want to fit a linear model of var1 vs var2 for each group I might do the looping with purrr::map() or lapply().

Each element of the new list still has the grouping information attached via the list names.

map(dat_list, ~lm(var1 ~ var2, data = .x) )
# $a
# 
# Call:
# lm(formula = var1 ~ var2, data = .x)
# 
# Coefficients:
# (Intercept)         var2  
#     2.64826      0.04396  
# 
# 
# $b
# 
# Call:
# lm(formula = var1 ~ var2, data = .x)
# 
# Coefficients:
# (Intercept)         var2  
#     3.80822     -0.02551  
# 
# 
# $c
# 
# Call:
# lm(formula = var1 ~ var2, data = .x)
# 
# Coefficients:
# (Intercept)         var2  
#     3.35241      0.03513

I could also create a function that fit a model and then returned model output. For example, maybe what I really wanted to do is the fit a linear model and extract \(R^2\) for each group model fit.

r2 = function(data) {
     fit = lm(var1 ~ var2, data = data)
     
     broom::glance(fit)
}

The output of my r2 function, which uses broom::glance(), is a data.frame.

r2(data = dat)
# # A tibble: 1 x 11
#   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
#       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
# 1    0.0292      -0.00550 0.867     0.841   0.367     2  -37.3  80.5  84.7
# # ... with 2 more variables: deviance <dbl>, df.residual <int>

Since the function output is a data.frame, I can use purrr::map_dfr() to combine the output per group into a single data.frame. The .id argument creates a new variable to store the list names in the output.

map_dfr(dat_list, r2, .id = "id")
# # A tibble: 3 x 12
#   id    r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
#   <chr>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl>
# 1 a        0.0775       -0.0378 0.968     0.672   0.436     2  -12.7  31.5
# 2 b        0.0387       -0.0815 0.832     0.322   0.586     2  -11.2  28.5
# 3 c        0.0285       -0.0930 0.808     0.235   0.641     2  -10.9  27.9
# # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>

Splitting by multiple groups

It is possible to split data by multiple grouping variables in the split() function. The grouping variables must be passed as a list.

Here’s an example, using the built-in mtcars dataset. I show only the first two list elements to demonstrate that the list names are now based on a combination of the values for the two groups. By default these values are separated by a . (but see the sep argument to control this).

mtcars_cylam = split(mtcars, list(mtcars$cyl, mtcars$am) )
mtcars_cylam[1:2]
# $`4.0`
#                mpg cyl  disp hp drat    wt  qsec vs am gear carb
# Merc 240D     24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2
# Merc 230      22.8   4 140.8 95 3.92 3.150 22.90  1  0    4    2
# Toyota Corona 21.5   4 120.1 97 3.70 2.465 20.01  1  0    3    1
# 
# $`6.0`
#                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
# Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
# Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
# Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4

If all combinations of groups are not present, the drop argument in split() allows us to drop missing combinations. By default combinations that aren’t present are kept as 0-length data.frames.

Other thoughts on split()

I feel like split() was a gateway function for me to get started working with lists and associated convenience functions like lapply() and purrr::map() for looping through lists. I think learning to work with lists and “list loops” also made the learning curve for list-columns in data.frames and the nest()/unnest() approach of analysis-by-groups a little less steep for me.

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(purrr) # 0.3.3

dat = structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), 
    var1 = c(4, 2.7, 3.4, 2.7, 4.6, 2.9, 2.2, 4.5, 4.6, 2.4, 
    3, 3.8, 2.5, 4, 3.6, 2.7, 4.5, 4.1, 4.2, 2.2, 4.9, 4.4, 3.6, 
    3.3, 2.7, 3.9, 4.9, 4.9, 4.3, 3.4), var2 = c(6, 22.3, 19.4, 
    22.8, 18.6, 14.2, 10.9, 22.7, 22.4, 11.7, 6, 13.3, 12.5, 
    6.3, 13.6, 20.5, 23.6, 10.9, 8.9, 20.9, 23.7, 15.9, 22.1, 
    11.6, 22, 17.7, 21, 20.8, 16.7, 21.4)), class = "data.frame", row.names = c(NA, 
-30L))

head(dat)

dat_list = split(dat, dat$id)
dat_list

map(dat_list, ~lm(var1 ~ var2, data = .x) )

r2 = function(data) {
     fit = lm(var1 ~ var2, data = data)
     
     broom::glance(fit)
}
r2(data = dat)

map_dfr(dat_list, r2, .id = "id")

mtcars_cylam = split(mtcars, list(mtcars$cyl, mtcars$am) )
mtcars_cylam[1:2]

To leave a comment for the author, please follow the link and comment on their blog: Very statisticious on Very statisticious.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.