Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There seems to be some revolution going on in the R sphere… people seem to be jumping at what is commonly known as the tidyverse, a collection of packages developed and maintained by the Chief Scientist of RStudio, Hadley Wickham.

In this post, I explain what the tidyverse is and why I resist using it, so read on!

Ok, so this post is going to be controversial, I am fully aware of that. The easiest way to deal with it if you are a fan of the tidyverse is to put it into the category “this guy is a dinosaur and hasn’t yet got the point of it all”… Fine, this might very well be the case and I cannot guarantee that I will change my mind in the future, so bear with me as I share some of my musings on the topic as I feel about it today… and do not hesitate to comment below!

According to his own website, the tidyverse is an opinionated collection of R packages designed for data science [highlighting my own]. “Opinionated”… when you google that word it says:

characterized by conceited assertiveness and dogmatism.
“an arrogant and opinionated man”

If you ask me it is no coincidence that this is the first statement on the webpage!

Before continuing, I want to make clear that I believe that Hadley Wickham does what he does out of a strong commitment to the R community and that his motivations are well-meaning. He obviously is also a person who is almost eerily productive (and to add the obvious: RStudio is a fantastic integrated development environment (IDE) which is looking for its equal in the Python world!). Having said that I think the tidyverse is creating some conflict within the community which at the end could have detrimental ramifications:

The tidyverse is creating some meta-layer on top of Base R, which changes the character of the language considerably. Just take the highly praised pipe operator %>%:

# Base R
temp <- mean(c(123, 987, 756))
temp
## [1] 622

# tidyverse
library(dplyr)
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
temp <- c(123, 987, 756) %>% mean
temp
## [1] 622


The problem I have with this is that the direction of the data flow is totally inconsistent: it starts in the middle with the numeric vector, goes to the right into the mean function (by the pipe operator %>%) and after that to the right into the variable (by the assignment operator <-). It is not only longer but also less clear in my opinion.

I know fans of the tidyverse will hasten to add that it can make code clearer when you have many nested functions but I would counter that there are also other ways to make your code clearer in this regard, e.g. by separating the functions into different lines of code, each with an assignment operator… which used to be the standard way!

But I guess my main point is that R is becoming a different beast this way: we all know that R – as any programming language – has its quirks and idiosyncrasies. The same holds true for the tidyverse (remember: any!). My philosophy has always been to keep any programming language as pure as possible, which doesn’t mean that you have to program everything from scratch… it means that you should only e.g. add packages for functional requirements and only very cautiously for structural ones.

This is, by the way, one of my criticisms on Python: you have the basic language but in order to do serious data science need all kinds of additional packages, which change the structure of the language (to read more on that see here: Why R for Data Science – and not Python?)!

At the end you will in most cases have some kind of strange mixture of the differnt data and programming approaches which makes the whole thing even more messy. As a professor, I also see the difficulties in teaching that stuff without totally confusing my students. This is often the problem with Python + NumPy + SciPy + PANDAS + SciKit-Learn + Matplotlib and I see the same kind of problems with R + ggplot2 + dplyr + tidyr + readr + purrr + tibble + stringr + forcats!

On top of that is the ever-growing complexity a problem because of all the dependencies. I am always skeptical of code where dozens of packages have to be installed and loaded first. Even in the simple code above just by loading the dplyr package (which is only one out of the eight tidyverse packages), several base R functions are being overwritten: filter, lag, intersect, setdiff, setequal and union.

In a way, the tidyverse feels (at least to me) like some kind of land grab, some kind of takeover. It is almost like a religion… and that I do not like! This is different with other popular packages, like Rcpp: with Rcpp you do the same stuff but faster… with the tidyverse you do the same stuff but only differently (I know, in some cases it is faster as well but that is often not the reason for using it… contrary to the excellent data.table package)!

One final thought: Hadley Wickham was asked the following question in 2016 (source: Quora):

Do you expect the tidyverse to be the part of core R packages someday?

It’s extremely unlikely because the core packages are extremely conservative so that base R code is stable, and backward compatible. I prefer to have a more utopian approach where I can be quite aggressive about making backward-incompatible changes while trying to figure out a better API.

Wow, that is something! To be honest with you: when it comes to software I like conservative! I like stable! I like backward compatible, especially in a production environment!

Everybody can (and should) do their own experiments “to figure out a better [whatever]” but please never touch a running system (if it ain’t broke don’t fix it!), especially not when millions of critical business and science applications depend on it!

Ok, so this was my little rant… now it is your turn to shoot!