Tidylog. Logging your pipelines

Posted on January 20, 2020 by R | TypeThePipe in R bloggers | 0 Comments

[This article was first published on R | TypeThePipe, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Some time ago I made one of the best discoveries I have ever made in the Tidyverse: a tool called tidylog. This package is built on top of dplyr and tidyr and provides us with feedback on the results of the operations. Actually, this is a feature that already appeared in the Stata software.

When performing one operation at a time, it is easy to track the changes made on a table. However things get increasingly obscure when chaining multiple functions or dealing with big data frames.

We all love piping operations. I often ‘play’ to perform the whole transformation without leaving the pipeflow. But the counterpart is missing the intermediate states: you can make some big mistakes and be unaware of them until it’s too late and maybe you have to undone some work or rethink your analysis.

In this context, some additional info is always welcome. I think this feature is specially convenient for beginners, but not only! I have myself wasted several hours debugging long pipelines and trying to understand where the problems came from.

Let’s see a tiny bit of its behaviour with a simple example:

library(nycflights13)
library(tidyverse)
library(tidylog)

flights %>% 
 select(year:day, hour, origin, dest, tailnum, carrier) %>% 
 mutate(month = if_else(nchar(month) == 1, paste0("0",month), as.character(month)),
 day = if_else(nchar(day) == 1, paste0("0",day), as.character(day))) %>% 
 unite("date", year:day, sep = "/", remove = T) %>% 
 mutate(date = lubridate::ymd(date)) %>% 
 filter(hour >= 8) %>% 
 anti_join(planes, by = "tailnum") %>% 
 count(tailnum, sort = TRUE) 

# select: dropped 11 variables (dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, …)
# mutate: converted 'month' from integer to character (0 new NA)
# converted 'day' from integer to character (0 new NA)
# mutate: converted 'date' from character to double (0 new NA)
# filter: removed 50,726 rows (15%), 286,050 rows remaining
# anti_join: added no columns
# > rows only in x 45,008
# > rows only in y ( 39)
# > matched rows (241,042)
# > =========
# > rows total 45,008
# count: now 716 rows and 2 columns, ungrouped

Pretty neat! It is specially useful with joins, as it provides plenty of details and they can be a source of duplicated or missing rows.

I decided to write this little post now to celebrate that tidylog v1.0.0 has recently been released! Check the official repo out to see more examples or show some love to @elbersb on Twitter!

All in all, I think this package was a missing piece in the Tidyverse ecosystem: It is incredibly useful, whereas making advantage of it is as simple as writing library(tidylog). Integrating this package into our daily R work is a no-brainer!

To leave a comment for the author, please follow the link and comment on their blog: R | TypeThePipe.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Tidylog. Logging your pipelines

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)