Tidyverse Series – dplyr – Episode 1

[This article was first published on R Blogs, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In this series, we are going to talk about data manipulation in R. The data we obtain or collect from various sources is usually unstructured or comprises information that is futile for our purpose. Therefore, spending time on the entire dataset is needless and inessential.

So, Data Manipulation is the key. We transform our data according to our needs which is significant for robust analysis. We manipulate data for further easy processing.

R has many built-in libraries and functions for our desired task.

In this episode, we start with the widely used tidyverse which is an opinionated collection of R packages designed for data science.

tidyverse

The tidyverse is an umbrella term comprising many packages for data interaction. Let’s learn its installation process which similar to any other installation.

Also load it into your R session.

#install.packages("tidyverse")

library(tidyverse)

There are four main packages for Data wrangling and Transformation:
* dplyr
* tidyr
* stringr
* forcats

We are going to learn about the dplyr package in depth in this episode.

dplyr

dplyr is a grammer of data manipulation providing a consistent set of verbs that solve the most common data manipulation challenges.

#install.packages("dplyr")

library(dplyr)

Single Table Verbs
This tutorial guides you through dplyr’s basic set of tools.

Choosing dataset

Using data() to reveal in-built datasets.
In this tutorial, we are going to use the Titanic dataset.

#data()

#Loading the dataset

data("Titanic")

Viewing the description of the dataset

str(Titanic)
##  'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
##  - attr(*, "dimnames")=List of 4
##   ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
##   ..$ Sex     : chr [1:2] "Male" "Female"
##   ..$ Age     : chr [1:2] "Child" "Adult"
##   ..$ Survived: chr [1:2] "No" "Yes"

Creating a local dataset for easier working

df_titanic <- as.data.frame(Titanic)

Observations x Features

dim(df_titanic)
## [1] 32  5

The pipe Function - a better alternative

All the functions take the dataset as the first argument. Using the %>% operator that is the pipe operator makes our working easier. For instance, an argument or the dataset x in some function(x,y) becomes x %>% function(y) which means that results from previous steps can be piped into another or the next steps.

Single Table Verbs

  • Functions manipulating the dataset with rows
1. filter(): subsetting our observations/rows based on filtered colums/features

How many children survived in the tragedy?

df_titanic %>% filter(Age == "Child")
##    Class    Sex   Age Survived Freq
## 1    1st   Male Child       No    0
## 2    2nd   Male Child       No    0
## 3    3rd   Male Child       No   35
## 4   Crew   Male Child       No    0
## 5    1st Female Child       No    0
## 6    2nd Female Child       No    0
## 7    3rd Female Child       No   17
## 8   Crew Female Child       No    0
## 9    1st   Male Child      Yes    5
## 10   2nd   Male Child      Yes   11
## 11   3rd   Male Child      Yes   13
## 12  Crew   Male Child      Yes    0
## 13   1st Female Child      Yes    1
## 14   2nd Female Child      Yes   13
## 15   3rd Female Child      Yes   14
## 16  Crew Female Child      Yes    0

How many adult females from the 3rd class survived the tragedy?

df_titanic %>% filter(Sex=="Female",Age == "Adult",Class=="3rd")
##   Class    Sex   Age Survived Freq
## 1   3rd Female Adult       No   89
## 2   3rd Female Adult      Yes   76
2. arrange() : subsetting our observations/rows based on arranging colums/features in unique or different order

Arranging rows according to the increasing order of number of people who survived:

df_titanic %>% arrange(Freq)
##    Class    Sex   Age Survived Freq
## 1    1st   Male Child       No    0
## 2    2nd   Male Child       No    0
## 3   Crew   Male Child       No    0
## 4    1st Female Child       No    0
## 5    2nd Female Child       No    0
## 6   Crew Female Child       No    0
## 7   Crew   Male Child      Yes    0
## 8   Crew Female Child      Yes    0
## 9    1st Female Child      Yes    1
## 10  Crew Female Adult       No    3
## 11   1st Female Adult       No    4
## 12   1st   Male Child      Yes    5
## 13   2nd   Male Child      Yes   11
## 14   2nd Female Adult       No   13
## 15   3rd   Male Child      Yes   13
## 16   2nd Female Child      Yes   13
## 17   3rd Female Child      Yes   14
## 18   2nd   Male Adult      Yes   14
## 19   3rd Female Child       No   17
## 20  Crew Female Adult      Yes   20
## 21   3rd   Male Child       No   35
## 22   1st   Male Adult      Yes   57
## 23   3rd   Male Adult      Yes   75
## 24   3rd Female Adult      Yes   76
## 25   2nd Female Adult      Yes   80
## 26   3rd Female Adult       No   89
## 27   1st   Male Adult       No  118
## 28   1st Female Adult      Yes  140
## 29   2nd   Male Adult       No  154
## 30  Crew   Male Adult      Yes  192
## 31   3rd   Male Adult       No  387
## 32  Crew   Male Adult       No  670
#in the descending order
df_titanic %>% arrange(desc(Freq))
##    Class    Sex   Age Survived Freq
## 1   Crew   Male Adult       No  670
## 2    3rd   Male Adult       No  387
## 3   Crew   Male Adult      Yes  192
## 4    2nd   Male Adult       No  154
## 5    1st Female Adult      Yes  140
## 6    1st   Male Adult       No  118
## 7    3rd Female Adult       No   89
## 8    2nd Female Adult      Yes   80
## 9    3rd Female Adult      Yes   76
## 10   3rd   Male Adult      Yes   75
## 11   1st   Male Adult      Yes   57
## 12   3rd   Male Child       No   35
## 13  Crew Female Adult      Yes   20
## 14   3rd Female Child       No   17
## 15   3rd Female Child      Yes   14
## 16   2nd   Male Adult      Yes   14
## 17   2nd Female Adult       No   13
## 18   3rd   Male Child      Yes   13
## 19   2nd Female Child      Yes   13
## 20   2nd   Male Child      Yes   11
## 21   1st   Male Child      Yes    5
## 22   1st Female Adult       No    4
## 23  Crew Female Adult       No    3
## 24   1st Female Child      Yes    1
## 25   1st   Male Child       No    0
## 26   2nd   Male Child       No    0
## 27  Crew   Male Child       No    0
## 28   1st Female Child       No    0
## 29   2nd Female Child       No    0
## 30  Crew Female Child       No    0
## 31  Crew   Male Child      Yes    0
## 32  Crew Female Child      Yes    0
3. slice() : indexing rows/observations using their integer locations
#rows between 6 to 27
df_titanic %>% slice(6:27)
##    Class    Sex   Age Survived Freq
## 1    2nd Female Child       No    0
## 2    3rd Female Child       No   17
## 3   Crew Female Child       No    0
## 4    1st   Male Adult       No  118
## 5    2nd   Male Adult       No  154
## 6    3rd   Male Adult       No  387
## 7   Crew   Male Adult       No  670
## 8    1st Female Adult       No    4
## 9    2nd Female Adult       No   13
## 10   3rd Female Adult       No   89
## 11  Crew Female Adult       No    3
## 12   1st   Male Child      Yes    5
## 13   2nd   Male Child      Yes   11
## 14   3rd   Male Child      Yes   13
## 15  Crew   Male Child      Yes    0
## 16   1st Female Child      Yes    1
## 17   2nd Female Child      Yes   13
## 18   3rd Female Child      Yes   14
## 19  Crew Female Child      Yes    0
## 20   1st   Male Adult      Yes   57
## 21   2nd   Male Adult      Yes   14
## 22   3rd   Male Adult      Yes   75
#first 3 rows
df_titanic %>% slice_head(n=3)
##   Class  Sex   Age Survived Freq
## 1   1st Male Child       No    0
## 2   2nd Male Child       No    0
## 3   3rd Male Child       No   35
#for selecting observation at random
df_titanic %>% slice_sample(n=6)
##   Class    Sex   Age Survived Freq
## 1   2nd   Male Child       No    0
## 2   3rd Female Adult      Yes   76
## 3   2nd Female Adult       No   13
## 4   2nd Female Adult      Yes   80
## 5   1st   Male Child       No    0
## 6   1st   Male Adult      Yes   57
#rows with top five values of Freq
df_titanic %>% slice_max(Freq,n=5)
##   Class    Sex   Age Survived Freq
## 1  Crew   Male Adult       No  670
## 2   3rd   Male Adult       No  387
## 3  Crew   Male Adult      Yes  192
## 4   2nd   Male Adult       No  154
## 5   1st Female Adult      Yes  140
  • Functions manipulating the dataset with columns
1. select() : selecting or choosing a few columns while discarding others that are of no use thereby reducing our valuable time

How many males and females survived the tragedy?

#shows only two columns
df_titanic %>% select(Sex,Survived)
##       Sex Survived
## 1    Male       No
## 2    Male       No
## 3    Male       No
## 4    Male       No
## 5  Female       No
## 6  Female       No
## 7  Female       No
## 8  Female       No
## 9    Male       No
## 10   Male       No
## 11   Male       No
## 12   Male       No
## 13 Female       No
## 14 Female       No
## 15 Female       No
## 16 Female       No
## 17   Male      Yes
## 18   Male      Yes
## 19   Male      Yes
## 20   Male      Yes
## 21 Female      Yes
## 22 Female      Yes
## 23 Female      Yes
## 24 Female      Yes
## 25   Male      Yes
## 26   Male      Yes
## 27   Male      Yes
## 28   Male      Yes
## 29 Female      Yes
## 30 Female      Yes
## 31 Female      Yes
## 32 Female      Yes

But,

#shows all the columns from 'sex' through to 'survived'(inclusive)
df_titanic %>% select(Sex:Survived)
##       Sex   Age Survived
## 1    Male Child       No
## 2    Male Child       No
## 3    Male Child       No
## 4    Male Child       No
## 5  Female Child       No
## 6  Female Child       No
## 7  Female Child       No
## 8  Female Child       No
## 9    Male Adult       No
## 10   Male Adult       No
## 11   Male Adult       No
## 12   Male Adult       No
## 13 Female Adult       No
## 14 Female Adult       No
## 15 Female Adult       No
## 16 Female Adult       No
## 17   Male Child      Yes
## 18   Male Child      Yes
## 19   Male Child      Yes
## 20   Male Child      Yes
## 21 Female Child      Yes
## 22 Female Child      Yes
## 23 Female Child      Yes
## 24 Female Child      Yes
## 25   Male Adult      Yes
## 26   Male Adult      Yes
## 27   Male Adult      Yes
## 28   Male Adult      Yes
## 29 Female Adult      Yes
## 30 Female Adult      Yes
## 31 Female Adult      Yes
## 32 Female Adult      Yes

And this code below deselects the mentioned columns as arguments:

df_titanic %>% select(-c(Sex,Age))
##    Class Survived Freq
## 1    1st       No    0
## 2    2nd       No    0
## 3    3rd       No   35
## 4   Crew       No    0
## 5    1st       No    0
## 6    2nd       No    0
## 7    3rd       No   17
## 8   Crew       No    0
## 9    1st       No  118
## 10   2nd       No  154
## 11   3rd       No  387
## 12  Crew       No  670
## 13   1st       No    4
## 14   2nd       No   13
## 15   3rd       No   89
## 16  Crew       No    3
## 17   1st      Yes    5
## 18   2nd      Yes   11
## 19   3rd      Yes   13
## 20  Crew      Yes    0
## 21   1st      Yes    1
## 22   2nd      Yes   13
## 23   3rd      Yes   14
## 24  Crew      Yes    0
## 25   1st      Yes   57
## 26   2nd      Yes   14
## 27   3rd      Yes   75
## 28  Crew      Yes  192
## 29   1st      Yes  140
## 30   2nd      Yes   80
## 31   3rd      Yes   76
## 32  Crew      Yes   20

Let us learn some of the helper functions that are used alomg with the select() function:

  • starts_with()
  • ends_with()
  • matches()
  • contains()

For instance:

df_titanic %>% select(starts_with("S"))
##       Sex Survived
## 1    Male       No
## 2    Male       No
## 3    Male       No
## 4    Male       No
## 5  Female       No
## 6  Female       No
## 7  Female       No
## 8  Female       No
## 9    Male       No
## 10   Male       No
## 11   Male       No
## 12   Male       No
## 13 Female       No
## 14 Female       No
## 15 Female       No
## 16 Female       No
## 17   Male      Yes
## 18   Male      Yes
## 19   Male      Yes
## 20   Male      Yes
## 21 Female      Yes
## 22 Female      Yes
## 23 Female      Yes
## 24 Female      Yes
## 25   Male      Yes
## 26   Male      Yes
## 27   Male      Yes
## 28   Male      Yes
## 29 Female      Yes
## 30 Female      Yes
## 31 Female      Yes
## 32 Female      Yes
2. mutate() : It is basically used to add new columns which are created using the existing columns as the new columns that are to be added i.e mutated are functions of the existng columns

Suppose there were about a thousand people on board, then what percenatge of people survived?

#Assume total = 1000 thus 100/1000 for percentage i.e 0.1
df_titanic %>% mutate(Freq_per =  Freq * 0.1)
##    Class    Sex   Age Survived Freq Freq_per
## 1    1st   Male Child       No    0      0.0
## 2    2nd   Male Child       No    0      0.0
## 3    3rd   Male Child       No   35      3.5
## 4   Crew   Male Child       No    0      0.0
## 5    1st Female Child       No    0      0.0
## 6    2nd Female Child       No    0      0.0
## 7    3rd Female Child       No   17      1.7
## 8   Crew Female Child       No    0      0.0
## 9    1st   Male Adult       No  118     11.8
## 10   2nd   Male Adult       No  154     15.4
## 11   3rd   Male Adult       No  387     38.7
## 12  Crew   Male Adult       No  670     67.0
## 13   1st Female Adult       No    4      0.4
## 14   2nd Female Adult       No   13      1.3
## 15   3rd Female Adult       No   89      8.9
## 16  Crew Female Adult       No    3      0.3
## 17   1st   Male Child      Yes    5      0.5
## 18   2nd   Male Child      Yes   11      1.1
## 19   3rd   Male Child      Yes   13      1.3
## 20  Crew   Male Child      Yes    0      0.0
## 21   1st Female Child      Yes    1      0.1
## 22   2nd Female Child      Yes   13      1.3
## 23   3rd Female Child      Yes   14      1.4
## 24  Crew Female Child      Yes    0      0.0
## 25   1st   Male Adult      Yes   57      5.7
## 26   2nd   Male Adult      Yes   14      1.4
## 27   3rd   Male Adult      Yes   75      7.5
## 28  Crew   Male Adult      Yes  192     19.2
## 29   1st Female Adult      Yes  140     14.0
## 30   2nd Female Adult      Yes   80      8.0
## 31   3rd Female Adult      Yes   76      7.6
## 32  Crew Female Adult      Yes   20      2.0
3. relocate() : basically relocating or moving columns

In the code below, the columns between ‘sex’ and ‘Age’ (both inclusive) are placed before the column ‘Class’ using the .before attribute.

df_titanic %>% relocate(Sex:Age, .before = Class)
##       Sex   Age Class Survived Freq
## 1    Male Child   1st       No    0
## 2    Male Child   2nd       No    0
## 3    Male Child   3rd       No   35
## 4    Male Child  Crew       No    0
## 5  Female Child   1st       No    0
## 6  Female Child   2nd       No    0
## 7  Female Child   3rd       No   17
## 8  Female Child  Crew       No    0
## 9    Male Adult   1st       No  118
## 10   Male Adult   2nd       No  154
## 11   Male Adult   3rd       No  387
## 12   Male Adult  Crew       No  670
## 13 Female Adult   1st       No    4
## 14 Female Adult   2nd       No   13
## 15 Female Adult   3rd       No   89
## 16 Female Adult  Crew       No    3
## 17   Male Child   1st      Yes    5
## 18   Male Child   2nd      Yes   11
## 19   Male Child   3rd      Yes   13
## 20   Male Child  Crew      Yes    0
## 21 Female Child   1st      Yes    1
## 22 Female Child   2nd      Yes   13
## 23 Female Child   3rd      Yes   14
## 24 Female Child  Crew      Yes    0
## 25   Male Adult   1st      Yes   57
## 26   Male Adult   2nd      Yes   14
## 27   Male Adult   3rd      Yes   75
## 28   Male Adult  Crew      Yes  192
## 29 Female Adult   1st      Yes  140
## 30 Female Adult   2nd      Yes   80
## 31 Female Adult   3rd      Yes   76
## 32 Female Adult  Crew      Yes   20
  • Functions manipulating the dataset with group of data
1. summarise() : collapses a data frame to a single row

Using mean of the Freq column and removing any NA values that exist

df_titanic %>% summarise(Freq = mean(Freq, na.rm = T))
##       Freq
## 1 68.78125

Note: The similarities in all of the above functions is their syntax like the dataset as their first argument which can be piped as well and the next set of arguments as to what action has to be performed on the dataset, etc.

So, this is it. We chose a simple, small dataset to start with and learnt how to manipulate data using a few simple functions.

You can chose any other dataset from the in-built ones or from many websites offering datasets for free. Here are some of the sites that I recommend:
Gapminder
Data.gov
Data.world

Stay tuned for more tutorials!
Thank You!

To leave a comment for the author, please follow the link and comment on their blog: R Blogs.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)