# Expanding binomial counts to binary 0/1 with purrr::pmap()

**Very statisticious on Very statisticious**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data on successes and failures can be summarized and analyzed as counted proportions via the binomial distribution or as long format 0/1 binary data. I most often see summarized data when there are multiple trials done within a study unit; for example, when tallying up the number of dead trees out of the total number of trees in a plot.

If these within-plot trials are all independent, analyzing data in a binary format instead of summarized binomial counts doesn’t change the statistical results. If trials are not independent, though, neither approach works correctly and we would see overdispersion/underdispersion in a binomial model. The confusing piece in this is that binary data by definition can’t be overdispersed and so the lack of fit from non-independence can’t be diagnosed with standard overdispersion checks when working with binary data.

In a future post I’ll talk more about simulating data to explore binomial overdispersion and how lack of fit can be diagnosed in binomial vs binary datasets. Today, however, my goal is show how to take binomial count data and expand it into binary data.

In the past I’ve done the data expansion with `rowwise()`

and `do()`

from package **dplyr**, but these days I’m using `purrr::pmap_dfr()`

. I’ll demonstrate the `pmap_dfr()`

approach as well as a `nest()`

/`unnest()`

approach using functions from **tidyr**.

## Table of Contents

# Load R packages

I’m using **purrr** for looping through rows with `pmap_dfr()`

. I also load **dplyr** and **tidyr** for a `nest()`

/`unnest()`

approach.

library(purrr) # 0.3.2 library(tidyr) # 1.0.0 library(dplyr) # 0.8.3

# The dataset

I created a dataset with a total of 8 plots, 4 plots in each of two groups. The data has been summarized up to the plot level. The number of trials (`total`

) per plot varied. The number of successes observed is in `num_dead`

.

dat = structure(list(plot = structure(1:8, .Label = c("plot1", "plot2", "plot3", "plot4", "plot5", "plot6", "plot7", "plot8"), class = "factor"), group = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("g1", "g2"), class = "factor"), num_dead = c(4L, 6L, 6L, 5L, 1L, 4L, 3L, 2L), total = c(5L, 7L, 9L, 7L, 8L, 10L, 10L, 7L)), class = "data.frame", row.names = c(NA, -8L)) dat # plot group num_dead total # 1 plot1 g1 4 5 # 2 plot2 g1 6 7 # 3 plot3 g1 6 9 # 4 plot4 g1 5 7 # 5 plot5 g2 1 8 # 6 plot6 g2 4 10 # 7 plot7 g2 3 10 # 8 plot8 g2 2 7

# Expanding binomial to binary with pmap_dfr()

To make the binomial data into binary data, I need to make a vector with a \(1\) for every “success” listed in `num_dead`

and a \(0\) for every “failure” (the total number of trials minus the number of successes). Since I want to do a *rowwise* operation I’ll use one of the `pmap`

functions. I want my output to be a data.frame so I use `pmap_dfr()`

.

I use an anonymous function within `pmap_dfr()`

for creating the output I want from each row. I purposely make the names of the function arguments match the column names. You can either match on position or on names in `pmap`

functions, and I tend to go for name matching. You can use formula coding with the tilde in `pmap`

variants, but I find the code more difficult to understand when I have more than three or so columns.

Within the function I make a column for the response variable, repeating \(1\) `num_dead`

times and \(0\) `total - num_dead`

times for each row of the original data. I’m taking advantage of recycling in `data.frame()`

to keep the `plot`

and `group`

columns in the output, as well.

binary_dat = pmap_dfr(dat, function(group, plot, num_dead, total) { data.frame(plot = plot, group = group, dead = c( rep(1, num_dead), rep(0, total - num_dead) ) ) } )

Here are the first 6 rows of this new dataset. You can see for the first plot, `plot1`

, there are five rows of \(1\) and one row of \(0\).

head(binary_dat) # plot group dead # 1 plot1 g1 1 # 2 plot1 g1 1 # 3 plot1 g1 1 # 4 plot1 g1 1 # 5 plot1 g1 0 # 6 plot2 g1 1

This matches the information in the first row of the original dataset.

dat[1, ] # plot group num_dead total # 1 plot1 g1 4 5

# Aside: pmap functions with more columns

My anonymous function in `pmap_dfr()`

works fine in its current form as long as every column is included as a function argument. If I had extra columns that I didn’t want to remove and wasn’t using in the function, however, I would get an error.

To bypass this problem you can add dots, `...`

, to the anonymous function to refer to all other columns not being used.

function(group, plot, num_dead, total, ...)

# Comparing analysis results

While I definitely learned that binomial data can be analyzed in binary format and returns identical results in a GLM class, for some reason I often have to re-convince myself this is true. ???? This is clear when I do an analysis with each dataset and compare results.

## Binomial model

Here’s results from comparing the two groups for the binomial model.

fit = glm( cbind(num_dead, total - num_dead) ~ group, data = dat, family = binomial) summary(fit)$coefficients # Estimate Std. Error z value Pr(>|z|) # (Intercept) 1.098612 0.4364358 2.517237 0.0118279240 # groupg2 -2.014903 0.5748706 -3.504968 0.0004566621

## Binary model

The binary model gives identical results for estimates and statistical tests.

fit_binary = glm( dead ~ group, data = binary_dat, family = binomial) summary(fit_binary)$coefficients # Estimate Std. Error z value Pr(>|z|) # (Intercept) 1.098612 0.4364354 2.517239 0.0118278514 # groupg2 -2.014903 0.5748701 -3.504971 0.0004566575

# Expanding binomial to binary via nesting

Doing the expansion with nesting plus `purrr::map()`

inside `mutate()`

is another option, although this seems less straightforward to me for this particular case. It does keep the other variables in the dataset, though, without having to manually include them in the output data.frame like I did above.

binary_dat2 = dat %>% nest(data = c(num_dead, total) ) %>% mutate(dead = map(data, ~c( rep(1, .x$num_dead), rep(0, .x$total - .x$num_dead) ) ) ) %>% select(-data) %>% unnest(dead) head(binary_dat2) # # A tibble: 6 x 3 # plot group dead ## 1 plot1 g1 1 # 2 plot1 g1 1 # 3 plot1 g1 1 # 4 plot1 g1 1 # 5 plot1 g1 0 # 6 plot2 g1 1

# Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(purrr) # 0.3.2 library(tidyr) # 1.0.0 library(dplyr) # 0.8.3 dat = structure(list(plot = structure(1:8, .Label = c("plot1", "plot2", "plot3", "plot4", "plot5", "plot6", "plot7", "plot8"), class = "factor"), group = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("g1", "g2"), class = "factor"), num_dead = c(4L, 6L, 6L, 5L, 1L, 4L, 3L, 2L), total = c(5L, 7L, 9L, 7L, 8L, 10L, 10L, 7L)), class = "data.frame", row.names = c(NA, -8L)) dat binary_dat = pmap_dfr(dat, function(group, plot, num_dead, total) { data.frame(plot = plot, group = group, dead = c( rep(1, num_dead), rep(0, total - num_dead) ) ) } ) head(binary_dat) dat[1, ] function(group, plot, num_dead, total, ...) fit = glm( cbind(num_dead, total - num_dead) ~ group, data = dat, family = binomial) summary(fit)$coefficients fit_binary = glm( dead ~ group, data = binary_dat, family = binomial) summary(fit_binary)$coefficients binary_dat2 = dat %>% nest(data = c(num_dead, total) ) %>% mutate(dead = map(data, ~c( rep(1, .x$num_dead), rep(0, .x$total - .x$num_dead) ) ) ) %>% select(-data) %>% unnest(dead) head(binary_dat2)

**leave a comment**for the author, please follow the link and comment on their blog:

**Very statisticious on Very statisticious**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.