Workaround for tidyr::spread with duplicate row identifiers

(This article was first published on Digital Age Economist on Digital Age Economist, and kindly contributed to R-bloggers)

The problem

The tidyverse api is one the easiest APIs in R to follow – thanks the concept of grammer based code. But every now and again you come across an error that boggles your mind, my personal one is the error spread delivers when I have no indexing for my data: Error: Duplicate identifiers for rows (2, 3), (1, 4, 5).... I am sure we have all seen this some or other time.

Well, lets reproduce this error:

library(dplyr)
library(tidyr)
df
## # A tibble: 5 x 2
##   age   gender
##     
## 1 21    Male  
## 2 17    Female
## 3 32    Female
## 4 29    Male  
## 5 15    Male
# Spread
df %>% 
  spread(gender, age)
## Error: Duplicate identifiers for rows (2, 3), (1, 4, 5)

To anyone doing the analysis, the output that he/she expects is quite trivial, the output should be:

## # A tibble: 3 x 2
##   Female Male 
## *   
## 1 17     21   
## 2 32     29   
## 3    15

But given that there is no index for spread to use and nothing stopping the 2nd and 3rd value for Male to collapse to a single row, spread decides its better to throw an error. There are multiple questions on stackoverflow of people asking how to spread the df without getting the warning. Here are some links:

And the list goes on…

The solution

Luckily I stumbled accross a really nice issue on the tidyr github repo page. The user, markdly, openend the issue, but in the end added a nice workaround for these kind of problems. The idea is to add an index using group_by and row_number and just removing the column after the widening of the data.

df <- df %>% 
  group_by(gender) %>% 
  mutate(grouped_id = row_number())
df
## # A tibble: 5 x 3
## # Groups:   gender [2]
##   age   gender grouped_id
##           
## 1 21    Male            1
## 2 17    Female          1
## 3 32    Female          2
## 4 29    Male            2
## 5 15    Male            3
# Now Spread
df %>% 
  spread(gender, age) %>% 
  select(-grouped_id)
## # A tibble: 3 x 2
##   Female Male 
## *   
## 1 17     21   
## 2 32     29   
## 3    15

Using this method of giving spread an index ensures no duplicated row names are introduced and a rectangular dataset can remain.

I hope that this little gem of an insight into working with spread has saved you some time!

To leave a comment for the author, please follow the link and comment on their blog: Digital Age Economist on Digital Age Economist.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)