Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Category
Tags
When I have a dataset with many variables and want to create a new variable for each of them, then the first thing comes into my mind is to write a new line of code for each transformation (e.g., new variable). It is time-consuming when I have more than 10 variables. Therefore, as “an advanced R user,” I will use mutate_all to create a new variable for each variable included in the dataset. 
However, in most “real life” circumstances, I don't want to create a new variable for all variables in the dataset, but only for a few of them. For example, I don't want to create a new variable for ID or categorical variables. In this situation, the mutate_at became a useful function in the tidyverse package.
Let say that I want to create a tertiles for a selected list of variables in esoph dataset.
Library and data
library(tidyverse) dat = as.data.frame(esoph) dat ## agegp alcgp tobgp ncases ncontrols ## 1 25-34 0-39g/day 0-9g/day 0 40 ## 2 25-34 0-39g/day 10-19 0 10 ## 3 25-34 0-39g/day 20-29 0 6 ## 4 25-34 0-39g/day 30+ 0 5 ## 5 25-34 40-79 0-9g/day 0 27 ## 6 25-34 40-79 10-19 0 7 ## 7 25-34 40-79 20-29 0 4 ## 8 25-34 40-79 30+ 0 7 ## 9 25-34 80-119 0-9g/day 0 2 ## 10 25-34 80-119 10-19 0 1 ## 11 25-34 80-119 30+ 0 2 ## 12 25-34 120+ 0-9g/day 0 1 ## 13 25-34 120+ 10-19 1 1 ## 14 25-34 120+ 20-29 0 1 ## 15 25-34 120+ 30+ 0 2 ... ...
The problem
I want to create tertiles for each variable in the dataset and want to exclude the ncases and ncontrols from the computation.
The solution
Here is the code (see below the explanation):
dat %>% 
  mutate_at(list(tertile = ~ntile(., 3)), .vars = vars(ends_with("gp"), -starts_with("nc")))
##    agegp     alcgp    tobgp ncases ncontrols agegp_tertile alcgp_tertile
## 1  25-34 0-39g/day 0-9g/day      0        40             1             1
## 2  25-34 0-39g/day    10-19      0        10             1             1
## 3  25-34 0-39g/day    20-29      0         6             1             1
## 4  25-34 0-39g/day      30+      0         5             1             1
## 5  25-34     40-79 0-9g/day      0        27             1             1
## 6  25-34     40-79    10-19      0         7             1             1
## 7  25-34     40-79    20-29      0         4             1             1
## 8  25-34     40-79      30+      0         7             1             1
## 9  25-34    80-119 0-9g/day      0         2             1             2
## 10 25-34    80-119    10-19      0         1             1             2
## 11 25-34    80-119      30+      0         2             1             2
## 12 25-34      120+ 0-9g/day      0         1             1             3
...
...
The ntile function is used to create tertiles of the variables. The ends_with will select variables of interests, and given that all ends with “gp” I used that function. The function -starts_with will exlclude the variables ncases and ncontrols.
Another example
If you want to do other computations such as standardize the variables youu can use the code below. In this example I am using the ncases and ncontrols because are continuous.
dat %>% 
  mutate_at(list(sd = ~./sd(.)), .vars = vars(-ends_with("gp"), starts_with("nc")))
##    agegp     alcgp    tobgp ncases ncontrols ncases_sd ncontrols_sd
## 1  25-34 0-39g/day 0-9g/day      0        40 0.0000000   3.14398607
## 2  25-34 0-39g/day    10-19      0        10 0.0000000   0.78599652
## 3  25-34 0-39g/day    20-29      0         6 0.0000000   0.47159791
## 4  25-34 0-39g/day      30+      0         5 0.0000000   0.39299826
## 5  25-34     40-79 0-9g/day      0        27 0.0000000   2.12219060
## 6  25-34     40-79    10-19      0         7 0.0000000   0.55019756
## 7  25-34     40-79    20-29      0         4 0.0000000   0.31439861
## 8  25-34     40-79      30+      0         7 0.0000000   0.55019756
## 9  25-34    80-119 0-9g/day      0         2 0.0000000   0.15719930
## 10 25-34    80-119    10-19      0         1 0.0000000   0.07859965
## 11 25-34    80-119      30+      0         2 0.0000000   0.15719930
## 12 25-34      120+ 0-9g/day      0         1 0.0000000   0.07859965
## 13 25-34      120+    10-19      1         1 0.3632179   0.07859965
## 14 25-34      120+    20-29      0         1 0.0000000   0.07859965
## 15 25-34      120+      30+      0         2 0.0000000   0.15719930
...
...
Thats all.
I hope you find these tips and trics useful for your data analysis.
Related Post
- Linear Algebra for Machine Learning and Deep Learning in R
- Linear algebra in R
- Management accounting with balance sheet and income statement in R
- Linear algebra in Python
- Connect Python to SQL
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
