Data Wrangling with dplyr – Part 3

[This article was first published on Rsquared Academy Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In the previous post, we learnt to combine tables using dplyr. In this post, we will explore a set of helper functions in order to:

  • extract unique rows
  • rename columns
  • sample data
  • extract columns
  • slice rows
  • arrange rows
  • compare tables
  • extract/mutate data using predicate functions
  • count observations for different levels of a variable

Libraries, Code & Data

We will use the following packages:

The data sets can be downloaded from here and the codes from here.

library(dplyr)
library(readr)

Case Study

Let us look at a case study (e-commerce data) and see how we can use dplyr helper functions to answer questions we have about and to modify/transform the underlying data set.

Data

ecom <- 
  read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/web.csv',
    col_types = cols_only(device = col_factor(levels = c("laptop", "tablet", "mobile")),
      referrer = col_factor(levels = c("bing", "direct", "social", "yahoo", "google")),
      purchase = col_logical(), bouncers = col_logical(), duration = col_double(),
      n_visit = col_double(), n_pages = col_double()
    )
  )

ecom
## # A tibble: 1,000 x 7
##    referrer device bouncers n_visit n_pages duration purchase
##    <fct>    <fct>  <lgl>      <dbl>   <dbl>    <dbl> <lgl>   
##  1 google   laptop TRUE          10       1      693 FALSE   
##  2 yahoo    tablet TRUE           9       1      459 FALSE   
##  3 direct   laptop TRUE           0       1      996 FALSE   
##  4 bing     tablet FALSE          3      18      468 TRUE    
##  5 yahoo    mobile TRUE           9       1      955 FALSE   
##  6 yahoo    laptop FALSE          5       5      135 FALSE   
##  7 yahoo    mobile TRUE          10       1       75 FALSE   
##  8 direct   mobile TRUE          10       1      908 FALSE   
##  9 bing     mobile FALSE          3      19      209 FALSE   
## 10 google   mobile TRUE           6       1      208 FALSE   
## # ... with 990 more rows

Data Dictionary

  • referrer: referrer website/search engine
  • device: device used to visit the website
  • bouncers: whether a visit bounced (exited from landing page)
  • duration: time spent on the website (in seconds)
  • purchase: whether visitor purchased
  • n_visit: number of visits
  • n_pages: number of pages visited/browsed

Data Sanitization

Let us ensure that the data is sanitized by checking the sources of traffic and devices used to visit the site. We will use distinct to examine the values in the referrer column



distinct(ecom, referrer)
## # A tibble: 5 x 1
##   referrer
##   <fct>   
## 1 google  
## 2 yahoo   
## 3 direct  
## 4 bing    
## 5 social

and the device column as well.

distinct(ecom, device)
## # A tibble: 3 x 1
##   device
##   <fct> 
## 1 laptop
## 2 tablet
## 3 mobile

Rename Columns

Columns can be renamed using rename().



rename(ecom, time_on_site = duration)
## # A tibble: 1,000 x 7
##    referrer device bouncers n_visit n_pages time_on_site purchase
##    <fct>    <fct>  <lgl>      <dbl>   <dbl>        <dbl> <lgl>   
##  1 google   laptop TRUE          10       1          693 FALSE   
##  2 yahoo    tablet TRUE           9       1          459 FALSE   
##  3 direct   laptop TRUE           0       1          996 FALSE   
##  4 bing     tablet FALSE          3      18          468 TRUE    
##  5 yahoo    mobile TRUE           9       1          955 FALSE   
##  6 yahoo    laptop FALSE          5       5          135 FALSE   
##  7 yahoo    mobile TRUE          10       1           75 FALSE   
##  8 direct   mobile TRUE          10       1          908 FALSE   
##  9 bing     mobile FALSE          3      19          209 FALSE   
## 10 google   mobile TRUE           6       1          208 FALSE   
## # ... with 990 more rows

Data Tabulation

Let us now look at the proportion or share of visits driven by different sources of traffic.



ecom %>%
  group_by(referrer) %>%
  tally()
## # A tibble: 5 x 2
##   referrer     n
##   <fct>    <int>
## 1 bing       194
## 2 direct     191
## 3 social     200
## 4 yahoo      207
## 5 google     208

We would also like to know the number of bouncers driven by the different sources of traffic.

ecom %>%
  group_by(referrer, bouncers) %>%
  tally()
## # A tibble: 10 x 3
## # Groups:   referrer [?]
##    referrer bouncers     n
##    <fct>    <lgl>    <int>
##  1 bing     FALSE      104
##  2 bing     TRUE        90
##  3 direct   FALSE       98
##  4 direct   TRUE        93
##  5 social   FALSE       93
##  6 social   TRUE       107
##  7 yahoo    FALSE      110
##  8 yahoo    TRUE        97
##  9 google   FALSE      101
## 10 google   TRUE       107

Let us look at how many conversions happen across different devices.

ecom %>%
  group_by(device, purchase) %>%
  tally() %>%
  filter(purchase)
## # A tibble: 3 x 3
## # Groups:   device [3]
##   device purchase     n
##   <fct>  <lgl>    <int>
## 1 laptop TRUE        31
## 2 tablet TRUE        36
## 3 mobile TRUE        36

Another way to extract the above information is by using count

ecom %>%
  count(referrer, purchase) %>%
  filter(purchase)
## # A tibble: 5 x 3
##   referrer purchase     n
##   <fct>    <lgl>    <int>
## 1 bing     TRUE        17
## 2 direct   TRUE        25
## 3 social   TRUE        20
## 4 yahoo    TRUE        22
## 5 google   TRUE        19

Sampling Data

dplyr offers sampling functions which allow us to specify either the number or percentage of observations. sample_n() allows sampling a specific number of observations.



sample_n(ecom, 700)
## # A tibble: 700 x 7
##    referrer device bouncers n_visit n_pages duration purchase
##    <fct>    <fct>  <lgl>      <dbl>   <dbl>    <dbl> <lgl>   
##  1 social   laptop TRUE           2       1      550 FALSE   
##  2 social   tablet TRUE           5       1      878 FALSE   
##  3 yahoo    mobile TRUE           2       1       90 FALSE   
##  4 direct   tablet TRUE           4       1      141 FALSE   
##  5 yahoo    laptop TRUE           5       1      611 FALSE   
##  6 bing     mobile FALSE         10       8      176 FALSE   
##  7 bing     tablet FALSE          0      14      280 FALSE   
##  8 direct   mobile TRUE           9       1      242 FALSE   
##  9 yahoo    laptop TRUE           2       1      506 FALSE   
## 10 yahoo    tablet FALSE          7       7      168 FALSE   
## # ... with 690 more rows

We can combine the sampling functions with other dplyr functions as shown below where we sample observation after grouping them according to the source of traffic.

ecom %>%
  group_by(referrer) %>%
  sample_n(100)
## # A tibble: 500 x 7
## # Groups:   referrer [5]
##    referrer device bouncers n_visit n_pages duration purchase
##    <fct>    <fct>  <lgl>      <dbl>   <dbl>    <dbl> <lgl>   
##  1 bing     laptop TRUE           1       1      938 FALSE   
##  2 bing     tablet FALSE          0       3       42 FALSE   
##  3 bing     laptop TRUE           1       1      148 FALSE   
##  4 bing     mobile FALSE         10       8      152 FALSE   
##  5 bing     laptop TRUE           4       1      857 FALSE   
##  6 bing     tablet FALSE          7       2       34 FALSE   
##  7 bing     tablet TRUE           7       1      718 FALSE   
##  8 bing     mobile FALSE          1      12      168 FALSE   
##  9 bing     mobile TRUE           5       1      475 FALSE   
## 10 bing     laptop FALSE          0      18      180 FALSE   
## # ... with 490 more rows

sample_frac() allows a specific percentage of observations.

sample_frac(ecom, size = 0.7)
## # A tibble: 700 x 7
##    referrer device bouncers n_visit n_pages duration purchase
##    <fct>    <fct>  <lgl>      <dbl>   <dbl>    <dbl> <lgl>   
##  1 bing     tablet TRUE           5       1      831 FALSE   
##  2 direct   laptop FALSE          8      20      440 FALSE   
##  3 yahoo    laptop TRUE           3       1      541 FALSE   
##  4 yahoo    mobile TRUE           4       1       69 FALSE   
##  5 social   mobile TRUE          10       1      198 FALSE   
##  6 google   mobile FALSE          7      19      456 FALSE   
##  7 direct   tablet TRUE          10       1      682 FALSE   
##  8 yahoo    laptop TRUE           3       1      373 FALSE   
##  9 social   mobile FALSE          3      13      208 FALSE   
## 10 bing     laptop TRUE           4       1      857 FALSE   
## # ... with 690 more rows

Data Extraction

In the first post, we had observed that dplyr verbs always returned a tibble. What if you want to extract a specific column or a bunch of rows but not as a tibble?

Use pull to extract columns either by name or position. It will return a vector. In the below example, we extract the device column as a vector. I am using head in addition to limit the output printed.

Sample Data

ecom_mini <- sample_n(ecom, size = 10)



pull(ecom_mini, device)
##  [1] mobile mobile tablet mobile mobile mobile tablet mobile laptop tablet
## Levels: laptop tablet mobile

Let us extract the first column from ecom using column position instead of name.

pull(ecom_mini, 1) 
##  [1] google google yahoo  direct yahoo  google google social google social
## Levels: bing direct social yahoo google

You can use - before the column position to indicate the position in reverse. The below example extracts data from the last column.

pull(ecom_mini, -1) 
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Let us now look at extracting rows using slice(). In the below example, we extract data starting from the 5th row and upto the 15th row.



slice(ecom, 5:15)
## # A tibble: 11 x 7
##    referrer device bouncers n_visit n_pages duration purchase
##    <fct>    <fct>  <lgl>      <dbl>   <dbl>    <dbl> <lgl>   
##  1 yahoo    mobile TRUE           9       1      955 FALSE   
##  2 yahoo    laptop FALSE          5       5      135 FALSE   
##  3 yahoo    mobile TRUE          10       1       75 FALSE   
##  4 direct   mobile TRUE          10       1      908 FALSE   
##  5 bing     mobile FALSE          3      19      209 FALSE   
##  6 google   mobile TRUE           6       1      208 FALSE   
##  7 direct   laptop TRUE           9       1      738 FALSE   
##  8 direct   tablet FALSE          6      12      132 FALSE   
##  9 direct   mobile FALSE          9      14      406 TRUE    
## 10 yahoo    tablet FALSE          5       8       80 FALSE   
## 11 yahoo    mobile FALSE          7       1       19 FALSE

Use n() inside slice() to extract the last row.

slice(ecom, n())
## # A tibble: 1 x 7
##   referrer device bouncers n_visit n_pages duration purchase
##   <fct>    <fct>  <lgl>      <dbl>   <dbl>    <dbl> <lgl>   
## 1 google   mobile TRUE           9       1      269 FALSE

Between

between() allows us to test if the values in a column lie between two specific values. In the below example, we check how many visits browsed pages between 5 and 15.

ecom_sample <- sample_n(ecom, 30)
  
ecom_sample %>%
  pull(n_pages) %>%
  between(5, 15) 
##  [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
## [12]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Case When

case_when() is an alternative to if else. It allows us to lay down the conditions clearly and makes the code more readable. In the below example, we create a new column repeat_visit from n_visit (the number of previous visits).

ecom %>%
  mutate(
    repeat_visit = case_when(
      n_visit > 0 ~ TRUE,
      TRUE ~ FALSE
    )
  ) %>%
  select(n_visit, repeat_visit) 
## # A tibble: 1,000 x 2
##    n_visit repeat_visit
##      <dbl> <lgl>       
##  1      10 TRUE        
##  2       9 TRUE        
##  3       0 FALSE       
##  4       3 TRUE        
##  5       9 TRUE        
##  6       5 TRUE        
##  7      10 TRUE        
##  8      10 TRUE        
##  9       3 TRUE        
## 10       6 TRUE        
## # ... with 990 more rows

To leave a comment for the author, please follow the link and comment on their blog: Rsquared Academy Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)