Calculating ratios with Tidyverse

Posted on May 12, 2020 by R | TypeThePipe in R bloggers | 0 Comments

[This article was first published on R | TypeThePipe, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Calculating percentages is a fairly common operation, right? However, doing it without leaving the pipeflow always force me to do some bizarre piping such as double grouping and summarise.

I am using again the nuclear accidents dataset, and trying to calculate the percentage of accidents that happened in Europe each year.

nuclear_accidents <- read_csv("https://query.data.world/s/47s7katrhxxd674ulus425k42l5u4s") 

nuclear_accidents <- nuclear_accidents %>% 
 select(-Description) %>% 
 mutate(Year = Date %>% mdy() %>% year(),
 In_Europe = if_else(Region %in% c("EE", "WE"), T, F) %>% as.factor()) %>% 
 filter(Year %>% between(1989, 2016))

nuclear_accidents %>% head(4) %>% kable()

Date	Location	Cost (millions 2013US$)	INES	Smyth Magnitude	Region	Fatalities	Year	In_Europe
3/11/2011	Fukushima Prefecture, Japan	166089.0	7	7.5	A	573	2011	FALSE
12/8/1995	Tsuruga, Japan	15500.0	NA	NA	A	0	1995	FALSE
12/19/1989	Vandellòs, Spain	930.9	3	NA	WE	0	1989	TRUE
2/1/2010	Vernon, Vermont, United States	808.9	NA	NA	NA	0	2010	FALSE

This can be achieved by several ways. One common path would be:

nuclear_accidents %>% 
 group_by(Year, In_Europe) %>% 
 summarise(N = n()) %>% 
 group_by(Year) %>% 
 mutate(Total_per_year = sum(N), 
 Ratio = round(N/Total_per_year, 2)) %>% 
 head(4) %>% 
 kable()

Year	In_Europe	N	Total_per_year	Ratio
1989	FALSE	4	6	0.67
1989	TRUE	2	6	0.33
1990	FALSE	1	1	1.00
1991	FALSE	3	3	1.00

Another one more bizarre would be totalizing first, then grouping including that amount (to avoid being dropped) and then summarise.

nuclear_accidents %>% 
 group_by(Year) %>% 
 mutate(Total_per_year = n()) %>% 
 group_by(Year, In_Europe, Total_per_year) %>% 
 summarise(N = n()) %>% 
 mutate(Ratio = round(N/Total_per_year, 2)) %>% 
 head(4) %>% 
 kable()

Year	In_Europe	Total_per_year	N	Ratio
1989	FALSE	6	4	0.67
1989	TRUE	6	2	0.33
1990	FALSE	1	1	1.00
1991	FALSE	3	3	1.00

Kind of weird. However, there is a much simpler way:

nuclear_accidents %>% 
 group_by(Year, In_Europe) %>%
 summarize(N = n()) %>%
 mutate(Ratio = round(N / sum(N), 2)) %>% 
 head(4) %>% 
 kable()

Year	In_Europe	N	Ratio
1989	FALSE	4	0.67
1989	TRUE	2	0.33
1990	FALSE	1	1.00
1991	FALSE	3	1.00

The first time I saw this result I didn’t understand it because if you have your dataframe grouped by Year and In_Europe then sum(N) should be equal to N. What is going on? This behaviour has to do with a tricky funcionality of summarise.

Take a closer look of the grouping variables at the console output. Before the summarise function the dataframe seems grouped normally and the operation will be performed within each group:

nuclear_accidents %>% 
 group_by(Year, In_Europe) %>% 
 head(4)
# A tibble: 4 x 9
# Groups: Year, In_Europe [4]
# Date Location `Cost (millions ~ INES `Smyth Magnitud~ Region Fatalities Year In_Europe
# <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <fct> 
# 1 3/11/~ Fukushima ~ 166089 7 7.5 A 573 2011 FALSE 
# 2 12/8/~ Tsuruga, J~ 15500 NA NA A 0 1995 FALSE 
# 3 12/19~ Vandellòs,~ 931. 3 NA WE 0 1989 TRUE 
# 4 2/1/2~ Vernon, Ve~ 809. NA NA NA 0 2010 FALSE

However, once the dataframe is summarized, the resulting dataframe is no longer grouped by the same original variables:

nuclear_accidents %>% 
 group_by(Year, In_Europe) %>%
 summarize(N = n()) %>% 
 head(4)
# A tibble: 4 x 3
# Groups: Year [3]
# Year In_Europe N
# <dbl> <fct> <int>
# 1 1989 FALSE 4
# 2 1989 TRUE 2
# 3 1990 FALSE 1
# 4 1991 FALSE 3

Actually, the default behaviour of summarise is to drop the last group. The reason behind that is that, once the operation is performed you should have only one obervation per group, and it has no sense to grouping by it anymore. That’s why the last example I show above works. Now you can take advantage of it too!

Furthermore, you can learn more about the dplyr 1.0.0 last minute additions which include an explicit message to highlight the behaviour we have talked about here.

To leave a comment for the author, please follow the link and comment on their blog: R | TypeThePipe.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Calculating ratios with Tidyverse

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)