Workaround for tidyr::spread with duplicate row identifiers

Posted on May 14, 2018 by Digital Age Economist on Digital Age Economist in R bloggers | 0 Comments

[This article was first published on Digital Age Economist on Digital Age Economist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The problem

The tidyverse api is one the easiest APIs in R to follow – thanks the concept of grammer based code. But every now and again you come across an error that boggles your mind, my personal one is the error spread delivers when I have no indexing for my data: Error: Duplicate identifiers for rows (2, 3), (1, 4, 5).... I am sure we have all seen this some or other time.

Well, lets reproduce this error:

library(dplyr)
library(tidyr)
df

## # A tibble: 5 x 2
##   age   gender
##   <chr> <fct> 
## 1 21    Male  
## 2 17    Female
## 3 32    Female
## 4 29    Male  
## 5 15    Male

# Spread
df %>% 
  spread(gender, age)

## Error: Duplicate identifiers for rows (2, 3), (1, 4, 5)

To anyone doing the analysis, the output that he/she expects is quite trivial, the output should be:

## # A tibble: 3 x 2
##   Female Male 
## * <chr>  <chr>
## 1 17     21   
## 2 32     29   
## 3 <NA>   15

But given that there is no index for spread to use and nothing stopping the 2nd and 3rd value for Male to collapse to a single row, spread decides its better to throw an error. There are multiple questions on stackoverflow of people asking how to spread the df without getting the warning. Here are some links:

And the list goes on…

The solution

Luckily I stumbled accross a really nice issue on the tidyr github repo page. The user, markdly, openend the issue, but in the end added a nice workaround for these kind of problems. The idea is to add an index using group_by and row_number and just removing the column after the widening of the data.

df <- df %>% 
  group_by(gender) %>% 
  mutate(grouped_id = row_number())
df

## # A tibble: 5 x 3
## # Groups:   gender [2]
##   age   gender grouped_id
##   <chr> <fct>       <int>
## 1 21    Male            1
## 2 17    Female          1
## 3 32    Female          2
## 4 29    Male            2
## 5 15    Male            3

# Now Spread
df %>% 
  spread(gender, age) %>% 
  select(-grouped_id)

## # A tibble: 3 x 2
##   Female Male 
## * <chr>  <chr>
## 1 17     21   
## 2 32     29   
## 3 <NA>   15

Using this method of giving spread an index ensures no duplicated row names are introduced and a rectangular dataset can remain.

I hope that this little gem of an insight into working with spread has saved you some time!

To leave a comment for the author, please follow the link and comment on their blog: Digital Age Economist on Digital Age Economist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Workaround for tidyr::spread with duplicate row identifiers

The problem

The solution

Related

The problem

The solution

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)