How to create multiple variables with a single line of code in R

[This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Category

Tags

When I have a dataset with many variables and want to create a new variable for each of them, then the first thing comes into my mind is to write a new line of code for each transformation (e.g., new variable). It is time-consuming when I have more than 10 variables. Therefore, as “an advanced R user,” I will use mutate_all to create a new variable for each variable included in the dataset.

However, in most “real life” circumstances, I don't want to create a new variable for all variables in the dataset, but only for a few of them. For example, I don't want to create a new variable for ID or categorical variables. In this situation, the mutate_at became a useful function in the tidyverse package.

Let say that I want to create a tertiles for a selected list of variables in esoph dataset.

Library and data

library(tidyverse)
dat = as.data.frame(esoph)
dat
##    agegp     alcgp    tobgp ncases ncontrols
## 1  25-34 0-39g/day 0-9g/day      0        40
## 2  25-34 0-39g/day    10-19      0        10
## 3  25-34 0-39g/day    20-29      0         6
## 4  25-34 0-39g/day      30+      0         5
## 5  25-34     40-79 0-9g/day      0        27
## 6  25-34     40-79    10-19      0         7
## 7  25-34     40-79    20-29      0         4
## 8  25-34     40-79      30+      0         7
## 9  25-34    80-119 0-9g/day      0         2
## 10 25-34    80-119    10-19      0         1
## 11 25-34    80-119      30+      0         2
## 12 25-34      120+ 0-9g/day      0         1
## 13 25-34      120+    10-19      1         1
## 14 25-34      120+    20-29      0         1
## 15 25-34      120+      30+      0         2
...
...

The problem

I want to create tertiles for each variable in the dataset and want to exclude the ncases and ncontrols from the computation.

The solution

Here is the code (see below the explanation):

dat %>% 
  mutate_at(list(tertile = ~ntile(., 3)), .vars = vars(ends_with("gp"), -starts_with("nc")))
##    agegp     alcgp    tobgp ncases ncontrols agegp_tertile alcgp_tertile
## 1  25-34 0-39g/day 0-9g/day      0        40             1             1
## 2  25-34 0-39g/day    10-19      0        10             1             1
## 3  25-34 0-39g/day    20-29      0         6             1             1
## 4  25-34 0-39g/day      30+      0         5             1             1
## 5  25-34     40-79 0-9g/day      0        27             1             1
## 6  25-34     40-79    10-19      0         7             1             1
## 7  25-34     40-79    20-29      0         4             1             1
## 8  25-34     40-79      30+      0         7             1             1
## 9  25-34    80-119 0-9g/day      0         2             1             2
## 10 25-34    80-119    10-19      0         1             1             2
## 11 25-34    80-119      30+      0         2             1             2
## 12 25-34      120+ 0-9g/day      0         1             1             3
...
...

The ntile function is used to create tertiles of the variables. The ends_with will select variables of interests, and given that all ends with “gp” I used that function. The function -starts_with will exlclude the variables ncases and ncontrols.

Another example

If you want to do other computations such as standardize the variables youu can use the code below. In this example I am using the ncases and ncontrols because are continuous.

dat %>% 
  mutate_at(list(sd = ~./sd(.)), .vars = vars(-ends_with("gp"), starts_with("nc")))
##    agegp     alcgp    tobgp ncases ncontrols ncases_sd ncontrols_sd
## 1  25-34 0-39g/day 0-9g/day      0        40 0.0000000   3.14398607
## 2  25-34 0-39g/day    10-19      0        10 0.0000000   0.78599652
## 3  25-34 0-39g/day    20-29      0         6 0.0000000   0.47159791
## 4  25-34 0-39g/day      30+      0         5 0.0000000   0.39299826
## 5  25-34     40-79 0-9g/day      0        27 0.0000000   2.12219060
## 6  25-34     40-79    10-19      0         7 0.0000000   0.55019756
## 7  25-34     40-79    20-29      0         4 0.0000000   0.31439861
## 8  25-34     40-79      30+      0         7 0.0000000   0.55019756
## 9  25-34    80-119 0-9g/day      0         2 0.0000000   0.15719930
## 10 25-34    80-119    10-19      0         1 0.0000000   0.07859965
## 11 25-34    80-119      30+      0         2 0.0000000   0.15719930
## 12 25-34      120+ 0-9g/day      0         1 0.0000000   0.07859965
## 13 25-34      120+    10-19      1         1 0.3632179   0.07859965
## 14 25-34      120+    20-29      0         1 0.0000000   0.07859965
## 15 25-34      120+      30+      0         2 0.0000000   0.15719930
...
...

Thats all.

I hope you find these tips and trics useful for your data analysis.

Related Post

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)