Treating your data: The old school vs tidyverse modern tools

[This article was first published on R – insightR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Gabriel Vasconcelos

When I first started using R there was no such thing as the tidyverse. Although some of the tidyverse packages were available independently, I learned to treat my data mostly using brute force combining pieces of information I had from several sources. It is very interesting to compare this old school programming with the tidyverse writing using the magrittr package. Even if you want to stay old school, tidyverse is here to stay and it is the first tool taught in many data science courses based on R.

My objective is to show a very simple example comparing the two ways of writing. There are several ways to do what I am going to propose here, but I think this example is enough to capture the main differences between old school codes and magrittr plus tidyverse. Magrittr is not new, but It seems to me that it is more popular now because of tidyverse.

To the example

I am going to generate a very simple data where we have two variables indexed by letters. My objective is to sum the two variables only in the values corresponding to vowels.

set.seed(123)
M = 1000
db1 = data.frame(id = sample(letters, 1000, replace = TRUE), v1 = rnorm(1000), v2 = rnorm(1000))
vowels=c("a", "e", "i", "o", "u")
head(db1)

##   id          v1         v2
## 1  h -0.60189285 -0.8209867
## 2  u -0.99369859 -0.3072572
## 3  k  1.02678506 -0.9020980
## 4  w  0.75106130  0.6270687
## 5  y -1.50916654  1.1203550
## 6  b -0.09514745  2.1272136

The first strategy (old school) to solve this problem is to use aggregate and then some manipulation. First I aggregate the variables to have the sum of each letter, then I select the vowels and use colsums to have the final result.

ag1 = aggregate( . ~ id, data = db1, FUN = sum)
ag1 = ag1[ag1$id %in% vowels, ]
ag1 = colSums(ag1[, -1])
ag1

##        v1        v2
## 26.656837  6.644839

The second strategy (tidyverse) uses functions from the dplyr package and the foward-pipe operator (%>%) from the magrittr. The foward-pipe allows us to do many operations in a single shot to get the final result. We do not need to create these auxiliary objects like I did in the previous example. The first two lines do precisely the same as the aggregate. The group_by defines the variable used to create the groups and the summarize tells R the grouping function. In the third line I select only the lines corresponding to vowels and the last summarize sums each variable. As you can see, the results are the same. This approach generated an object type called tibble, which is a special type of data frame from the tidyverse with some different features like not using factors for strings.

library(tidyverse)

ag2 = group_by(db1, id) %>%
  summarise(v1 = sum(v1), v2 = sum(v2)) %>%
  filter(id %in% vowels) %>%
  summarize(v1 = sum(v1), v2 = sum(v2))

ag2

## # A tibble: 1 x 2
##         v1       v2
##      <dbl>    <dbl>
## 1 26.65684 6.644839

The same thing using merge

Suppose that we want to do the same thing as the previous example but now we are dealing with two data frames: the one from the previous example and a second data frame of characteristics that will tell us which letters are vowels.

aux = rep("consonant",length(letters))
aux[which(letters %in% vowels)] = "vowel"
db2 = data.frame(id = letters, type = aux)
head(db2)

##   id      type
## 1  a     vowel
## 2  b consonant
## 3  c consonant
## 4  d consonant
## 5  e     vowel
## 6  f consonant

The first approach uses merge to combine the two data frames and then sum the observations that have id==vowel.

merge1 = merge(db1, db2, by = "id")
head(merge1)

##   id          v1         v2  type
## 1  a -0.73657823  1.1903106 vowel
## 2  a  0.07987382 -1.1058145 vowel
## 3  a -1.20086933  0.4859824 vowel
## 4  a  0.32040231 -0.6196151 vowel
## 5  a -0.69493683 -1.0387278 vowel
## 6  a  0.15735335  1.6165776 vowel

merge1 = colSums(merge1[merge1[,4] == "vowel", 2:3])
merge1

##        v1        v2
## 26.656837  6.644839

The second approach uses the function inner_join from the dplyr package, then it filters the vowels observations and uses summarize to sum the vowels observations.

merge2 = inner_join(db1, db2, by = "id") %>%
  filter(type == "vowel") %>%
  summarise(v1 = sum(v1), v2 = sum(v2))
merge2

##         v1       v2
## 1 26.65684 6.644839

As you can see, the two ways of writing are very different. Naturally, there is some cost to change from the old school to the tidyverse codes. However, the second makes your code easier to read, it is part of the tidyverse philosophy to write codes that can be read by humans. For example, something like this:

x = 1:10
sum(log(sqrt(x)))

## [1] 7.552206

becomes something like this if you use the foward-pipe:

x %>% sqrt() %>% log() %>% sum()

## [1] 7.552206

For more information check out the tidyverse website and the R For Data Science book, which is available for free on-line here.


To leave a comment for the author, please follow the link and comment on their blog: R – insightR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)