How to Modify Variables the Right Way in R

Posted on April 3, 2023 by R | TypeThePipe in R bloggers | 0 Comments

[This article was first published on R | TypeThePipe, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In data analysis and data science, it’s common to work with large datasets that require some form of manipulation to be useful. In this small article, we’ll explore how to create and modify columns in a dataframe using modern R tools from the tidyverse package. We can do that on several ways, so we are going from basic to advanced level.

Let’s use the starwars dataset for that purpose:

data("starwars")
head(starwars, 4)
# # A tibble: 4 × 8
# name height mass hair_color skin_color eye_color birth_year sex 
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> 
# 1 Luke Skywalker 172 77 blond fair blue 19 male 
# 2 C-3PO 167 75 NA gold yellow 112 none 
# 3 R2-D2 96 32 NA white, blue red 33 none 
# 4 Darth Vader 202 136 none white yellow 41.9 male

The most basic example is using mutate to create and modify variables.

starwars %>% 
 mutate(
 height = height * 2,
 new_numeric_column = row_number(),
 new_char_column = "This variable is new"
 ) %>% 
 select(name, height, new_numeric_column, new_char_column) %>% 
 head(4)
# # A tibble: 4 × 4
# name height new_numeric_column new_char_column 
# <chr> <dbl> <dbl> <chr> 
# 1 Luke Skywalker 344 1 This variable is new
# 2 C-3PO 334 2 This variable is new
# 3 R2-D2 192 3 This variable is new
# 4 Darth Vader 404 4 This variable is new

Note how the second variable we created is recycled to match the length of the dataset. But you already knew that, right?

Common tricks

A common trick is making use of if_else to conditionally modify some variables. I use this structure on a daily basis.

starwars %>% 
 mutate(height = if_else(hair_color == "blond", height * 10, height)) %>% 
 select(name, height, mass, hair_color) %>% 
 head(4)
# # A tibble: 4 × 8
# name height mass hair_color 
# <chr> <int> <dbl> <chr> 
# 1 Luke Skywalker 344 77 blond 
# 2 C-3PO 167 75 NA 
# 3 R2-D2 96 32 NA 
# 4 Darth Vader 202 136 none

Another common use case is to rely on the case_when function to modify the variable based on several conditions:

starwars %>% 
 mutate(mod_height = case_when(eye_color == "blue" ~ height/2,
 eye_color == "red" ~ height*2,
 eye_color == "yellow" ~ height*10,
 TRUE ~ height)) %>% 
 select(name, height, eye_color, mod_height) %>% 
 head(4)
# name height eye_color mod_height
# <chr> <int> <chr> <dbl>
# 1 Luke Skywalker 172 blue 86
# 2 C-3PO 167 yellow 1670
# 3 R2-D2 96 red 192
# 4 Darth Vader 202 yellow 2020

Note that we should end the case_when with an option that always yields TRUE because the conditions are evaluated in order. If our data doesn’t meet any condition we are leaving the column as is.

All these are fairly basic examples. Let’s go with the dplyr advanced way of creating and modifying variables.

The Advanced Way: Using across()

In modern R, we can simultaneously modify several columns at once using the verb across. We need to pass the transformation we will be performing on those variables as well. For that, we are using a lambda function which basically means that we are creating the function on-the-fly but we are not storing it.

starwars %>% 
 mutate(across(c(height, mass, birth_year), ~ .x*10)) %>% 
 select(name, height, mass, birth_year) %>% 
 head(4)
# # A tibble: 4 × 4
# name height mass birth_year
# <chr> <dbl> <dbl> <dbl>
# 1 Luke Skywalker 1720 770 190
# 2 C-3PO 1670 750 1120
# 3 R2-D2 960 320 330
# 4 Darth Vader 2020 1360 419

That’s quite nice but sometimes you don’t want to modify the existing columns but creating new ones. This is an important use case: batch-creating several columns at once based on the existing ones. I already discussed how to do it in How to create multiple lags like a Pro. We can use the .names argument to dynamically specify the new column names, like this:

starwars %>% 
 mutate(across(c(height, mass, birth_year), ~ .x * 100, .names = "{.col}_x100")) %>% 
 select(name, height, mass, birth_year, height_x100, mass_x100, birth_year_x100) %>% 
 head(4)
# # A tibble: 4 × 7
# name height mass birth_year height_x100 mass_x100 birth_year_x100
# <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Luke Skywalker 172 77 19 17200 7700 1900
# 2 C-3PO 167 75 112 16700 7500 11200
# 3 R2-D2 96 32 33 9600 3200 3300
# 4 Darth Vader 202 136 41.9 20200 13600 4190

Awesome, right? However, I still had to type them all manually. There is a better way.

The Pro Way: Using across() + tidyselectors

What if we want to modify a lot of columns? There must be a better way to avoid having to type them all… Sure there is! tidyselectors to the rescue! Those are a family of functions that allow us to dynamically select several columns based on a condition. Let’s see that with an example.

Let’s say we want to modify only the numerical variables. We can do that easily with the help of where function and the neat part is this family of functions works with several verbs of the Tidyverse. For instance, they work with select:

starwars %>% 
 select(name, where(is.numeric)) %>% 
 head(4)
# # A tibble: 4 × 4
# name height mass birth_year
# <chr> <int> <dbl> <dbl>
# 1 Luke Skywalker 172 77 19 
# 2 C-3PO 167 75 112 
# 3 R2-D2 96 32 33 
# 4 Darth Vader 202 136 41.9

But also with mutate! So combining across with where we can apply the function only over the desired columns (without having to type them!)

starwars %>% 
 mutate(across(where(is.numeric), ~ .x * 100)) %>% 
 select(name, where(is.numeric)) %>% 
 head(4)
# # A tibble: 4 × 4
# name height mass birth_year
# <chr> <dbl> <dbl> <dbl>
# 1 Luke Skywalker 17200 7700 1900
# 2 C-3PO 16700 7500 11200
# 3 R2-D2 9600 3200 3300
# 4 Darth Vader 20200 13600 4190

Note how the name feature hasn’t been modified, as it is not a numeric variable. This is a really handy trick specially when you are working with big datasets and need to perform an operation on many columns at once.

Also, it is worth noting that we can pass any function to across to modify the selected columns. We don’t necessarily have to define the operation with a lambda function, but any existing function can be used.

Here is another powerful example working with character columns. We can apply an existing function to make all of them uppercase:

starwars %>% 
 mutate(across(where(is.character), toupper)) %>% 
 select(where(is.character)) %>% 
 head(4)
# # A tibble: 4 × 8
# name hair_color skin_color eye_color sex gender homeworld species
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> 
# 1 LUKE SKYWALKER BLOND FAIR BLUE MALE MASCULINE TATOOINE HUMAN 
# 2 C-3PO NA GOLD YELLOW NONE MASCULINE TATOOINE DROID 
# 3 R2-D2 NA WHITE, BLUE RED NONE MASCULINE NABOO DROID 
# 4 DARTH VADER NONE WHITE YELLOW MALE MASCULINE TATOOINE HUMAN

Also, you don’t have to rely only on the where tidyselector, you can use many others like contains, matches, starts_with or ends_with.

Here’s another example example using ends_with:

starwars %>% 
 mutate(across(ends_with("_color"), ~ paste0("the color is ", .x))) %>% 
 select(name, ends_with("_color")) %>% 
 head(4)
# # A tibble: 4 × 4
# name hair_color skin_color eye_color 
# <chr> <chr> <chr> <chr> 
# 1 Luke Skywalker the color is blond the color is fair the color is blue 
# 2 C-3PO the color is NA the color is gold the color is yellow 
# 3 R2-D2 the color is NA the color is white, blue the color is red 
# 4 Darth Vader the color is none the color is white the color is yellow

Handy stuff, right? There is so much more possibilities to discover. You can read more about it on the across reference.

To leave a comment for the author, please follow the link and comment on their blog: R | TypeThePipe.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

How to Modify Variables the Right Way in R

Introduction

Common tricks

The Advanced Way: Using across()

The Pro Way: Using across() + tidyselectors

Related

Introduction

Common tricks

The Advanced Way: Using across()

The Pro Way: Using across() + tidyselectors

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)