Create New Variables in R with dplyr

Posted on December 20, 2023 by Zubair Goraya in R bloggers | 0 Comments

[This article was first published on RStudioDataLab, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Key Takeaways

The mutate function from the dplyr package allows you to create new variables or modify existing variables in a data frame or a tibble in R.
The variants of mutate, such as mutate_all, mutate_at, mutate_if, and mutate_across, allow you to apply functions to all selected or conditional variables or variables that match a pattern.
The case_when function allows you to create new variables based on multiple conditions, using a logical expression and a value for each case.
The glimpse, Kable, ggplot2, and diagrammeR functions allow you to display your data as tables and graphs, with some formatting options.
The mutate function is useful for data analysis in R because it allows you to manipulate your data flexibly, consistently, and efficiently and works well with other dplyr functions and packages.

Create New Variables in R with dplyr

Table of Contents

Hi, I’m Zubair Goraya, a data analyst with 5 years of experience. I love writing about data analysis in R. I will explain how to create new variables in R with dplyr in this article. dplyr is a package that provides functions for manipulating data frames and tibbles in R.

Mutate Syntax in R

Tibbles are a modern reimagining of data frames that are more consistent and convenient. dplyr functions are designed to be easy to use, fast, and consistent, and they follow the principle of “tidy data”, which means that each variable is a column, each observation is a row, and each value is a cell.

mutate is one of the main functions of dplyr, and it allows you to create new variables or modify existing variables in a data frame or a tibble. You can use mutate to perform various operations on your data, such as calculations, transformations, conditions, combinations, etc. mutate also works well with other dplyr functions, such as group_by, summarise, filter, arrange, etc.

Before We start, Make sure you read the following:

Basics of mutate

How to install and load dplyr

To use dplyr, you need to install it first. You can do that by running the following code in R:

install.packages("dplyr")

Then, you need to load it into your R session. You can do that by running the following code in R:

library(dplyr)

Before We start Make sure you Have:

How to create a data frame or a tibble

To use mutate, you must have a data frame or a tibble. You can create one from scratch or use an existing one from R or an external source. For this tutorial, I will create a data frame from scratch using the following code in R:

# Create a data frame with 10 rows and 4 columnsdf <- data.frame(
  name = c("Alice", "Bob", "Charlie", "David", "Eve", "Frank", "Grace", "Harry", "Ivy", "Jack"),
  age = c(25, 32, 28, 24, 27, 29, 31, 26, 30, 33),
  gender = c("F", "M", "M", "M", "F", "M", "F", "M", "F", "M"),
  score = c(85, 76, 92, 81, 88, 79, 94, 83, 90, 86)
)

Generate a random data set by using the data.frame function of R

It will create a data frame called df, with 10 rows and 4 columns. The columns are name, age, gender, and score, containing some random values.

How to use mutate to create a new variable

To create a new variable with mutate, you need to use the following syntax:

mutate(data, new_variable = expression)

Where

data is the name of the data frame or the tibble,
new_variable is the new variable's name, and expression is the formula or function that defines the new variable.

For example, if you want to create a new variable called grade based on the score variable, using the following criteria:

If the score is greater than or equal to 90, then the grade is A
If the score is between 80 and 89, then the grade is B
If the score is between 70 and 79, then the grade is C
If the score is less than 70, then the grade is D

You can use the following code in R:

# Create a new variable called grade with mutate
df %>%
  mutate(grade = case_when(
    score >= 90 ~ "A",
    score >= 80 & score < 90 ~ "B",
    score >= 70 & score < 80 ~ "C",
    score < 70 ~ "D"
  ))

Create a new variable called grade with mutate using R

It will create a new variable, grade, using the pipe operator (%>%) from the magrittr package, loaded with dplyr. The pipe operator (%>%) allows you to chain multiple functions together without repeating the data frame name.

The case_when function allows you to create a new variable based on multiple conditions, using the tilde (~) to separate the condition and the value.

How to use mutate to modify an existing variable

To modify an existing variable with mutate, you need to use the same syntax as creating a new variable but use the name of the existing variable instead of a new variable. For example, if you want to modify the age variable by adding 1 to each value, you can use the following code in R:

# Modify the age variable with mutate
df %>%
  mutate(age = age + 1)

Modify the age variable with mutate using dplyr function in R

It will modify the age variable and assign it to the data frame df using the pipe operator.

Related Posts

How to use mutate to create multiple new variables

To create multiple new variables with mutate, you need to use the same syntax as creating a single new variable but separate each new variable with a comma. For example, if you want to create two new variables called height and weight, based on some random values, you can use the following code in R

# Create two new variables called height and weight with mutate
df %>%
  mutate(
    height = runif(n = n(), min = 150, max = 200), # Generate a random number between 150 and 200 for each row
    weight = runif(n = n(), min = 50, max = 100) # Generate a random number between 50 and 100 for each row
  )

Using the pipe operator, it will create two new variables, height, and weight, and assign them to the data frame df. The runif function allows you to generate a random number between a minimum and a maximum value, and the n function will enable you to get the number of rows in the data frame.

How to use mutate to delete a variable

To delete a variable with mutate, you need to use the same syntax as creating a new variable but use NULL as the expression. For example, if you want to delete the score variable, you can use the following code in R:

# Delete the score variable with mutate
df %>%
  mutate(score = NULL)

It will delete the score variable and assign the updated data frame to df using the pipe operator.

Advanced features of mutate

How to use mutate_all to apply a function to all variables

To apply a function to all variables with mutate, you need to use the mutate_all function, a mutate variant. You need to use the following syntax:

mutate_all(data, function)

Where

data is the name of the data frame or the tibble,
function is the function name you want to apply to all variables.

For example, if you want to round all the numeric variables to the nearest integer, you can use the following code in R:

# Round all the numeric variables with mutate_all
df %>% select_if(is.numeric) %>% 
  mutate_all(round)

Round all the numeric variables with mutate_all fucntion using Rstudio

Using the pipe operator, it will round all the numeric variables and assign the updated data frame to df. The round function allows you to round a number to the nearest integer or a specified number of decimal places.

How to use mutate_at to apply a function to selected variables

To apply a function to selected variables with mutate, you need to use the mutate_at function, another mutate variant. You need to use the following syntax:

mutate_at(data, vars, function)

Where

data is the name of the data frame or the tibble
vars is a vector of variable names or positions you want to select, and function is the name of the function you wish to apply to the selected variables.

For example, if you want to convert the gender variable to uppercase, you can use the following code in R:

# Convert the gender variable to uppercase with mutate_at
df %>%
  mutate_at(vars(gender), toupper)

Convert the gender variable to uppercase with mutate_at function of dplyr using R

It will convert the gender variable to uppercase, and assign the updated data frame to df, using the pipe operator. The toupper function allows you to convert a character string to uppercase. You can also use the vars function to select variables by name or by using helper functions, such as starts_with, ends_with, contains, matches, etc.

How to use mutate_if to apply a function to variables that meet a condition

To apply a function to variables that meet a condition with mutate, you need to use the mutate_if function, another mutate variant. You need to use the following syntax:

mutate_if(data, predicate, function)

Where

data is the name of the data frame or the tibble,
the predicate is a logical expression that defines the condition,
function is the function name you want to apply to the variables that meet the condition.

For example, if you want to convert all the character variables to lowercase, you can use the following code in R:

# Convert all the character variables to lowercase with mutate_if
df %>%
  mutate_if(is.character, tolower)

Convert all the character variables to lowercase with mutate_if using function of dplyr package in R

Using the pipe operator, it will convert all the character variables to lowercase and assign the updated data frame to df. The is.character function allows you to check if a variable is a character type, and the tolower function allows you to convert a character string to lowercase.

You can also use other functions to define the predicate, such as is.numeric, is.factor, is.logical, etc.

How to use mutate with group_by functions

To perform more complex data manipulation tasks, you can mutate with other dplyr functions, such as group_by, summarise, filter, arrange, etc. For example, if you want to create a new variable called rank, which shows the rank of each person based on their score within each gender group, you can use the following code in R:

# Create a new variable called rank with mutate and other dplyr functions
df %>%
  group_by(gender) %>% # Group the data by gender
  mutate(rank = rank(-score)) %>% # Create a new variable called rank, which is the rank of each person based on their score, within each gender group
  ungroup() # Ungroup the data

Create a new variable called rank with mutate and other dplyr functions in R

It will create a new variable called rank, and assign the updated data frame to df using the pipe operator. The group_by function allows you to group the data by one or more variables, and the ungroup function will enable you to remove the grouping.

The rank function will allow you to rank the values of a variable, and the minus sign (-) allows you to rank them in descending order.

Examples and code snippets

How to create a new variable based on a condition

You can use the case_when function to create a new variable based on a condition, as shown in the previous example of creating the grade variable. Here is another example of creating a new variable called pass, which shows whether the person passed or failed the test based on the score variable, using the following criteria:

If the score is greater than or equal to 80, then pass is “Yes”
If the score is less than 80, then the pass is “No”

# Create a new variable called pass with case_when
df %>%
  mutate(pass = case_when(
         score >= 80 ~ "Yes",
         score < 80 ~ "No"))

Create a new variable called pass with case_when of dplyr library using R

Using the pipe operator, it will create a new pass variable.

How to create a new variable based on a calculation

You can use any arithmetic or mathematical operators or functions to create a new variable based on a calculation, as shown in the previous example of modifying the age variable. Here is another example of creating a new variable called BMI, which shows the body mass index of each person based on the height and weight variables, using the following formula:

# Create a new variable called bmi with a calculation
# Set seed for reproducibility
set.seed(123)
# Generate dummy data
num_rows <- 100
weights <- runif(num_rows, min = 50, max = 100)  # Generating random weight values
heights <- runif(num_rows, min = 150, max = 190)  # Generating random height values
# Create a data frame with the generated data
df1 <- tibble(weight = weights, height = heights)
# Use the provided code to calculate BMI and create a new column bmi
df1 %>% mutate(bmi = weight / (height / 100) ^ 2)

Create a new variable called bmi with a calculation using dplyr fucntion of Rstudio

It will create a new variable called bmi, and assign the updated data frame to df, using the pipe operator. The ^ operator allows you to raise a number to a power.

How to create a new variable based on a transformation

You can use any transformation functions, such as log, exp, sqrt, etc., to create a new variable based on a transformation, as shown in the previous example of rounding all the numeric variables. Here is another example of creating a new variable called log_score, which shows the natural logarithm of the score variable using the following code in R:

# Create a new variable called log_score with a transformation
df %>%
  mutate(log_score = log(score))

Create a new variable called log_score with a transformation

It will create a new variable called log_score and assign the updated data frame to df, using the pipe operator. The log function allows you to calculate the natural logarithm of a number.

How to create a new variable based on a combination of other variables

You can use any operators or functions that allow you to combine or concatenate other variables to create a new variable based on a combination of other variables, as shown in the previous example of creating the rank variable.

Here is another example of creating a new variable called id, which offers a unique identifier for each person based on the name and age variables, using the following code in R:

# Create a new variable called id with a combination of other variables
df %>%
  mutate(id = paste0(name, "_", age))

Using the pipe operator, it will create a new variable, id, and assign the updated data frame to df. The paste0 function allows you to concatenate character strings, and the underscore (_) is used as a separator.

Visuals and tables

How to use the glimpse function to see the structure of a data frame

You can use the glimpse function to see the structure of a data frame or a tibble, such as the number of rows, columns, and variables and the type and class of each variable. You need to use the following syntax:

glimpse(data)

Where data is the name of the data frame or the tibble. For example, if you want to see the structure of the df data frame, you can use the following code in R:

# See the structure of the df data frame with glimpse
glimpse(df)

See the structure of the df data frame with glimpse

It will show the number of rows and columns and each variable's name, type, and class in the data frame.

How to use the kable function to display a data frame as a table

You can use the kable function from the knitr package to display a data frame or a tibble as a table, with some formatting options. You need to use the following syntax:

kable(data, format, caption, align, col.names, row.names, digits, etc.)

Where

data is the name of the data frame or the tibble,
format is the output format, such as “markdown”, “html”, “latex”, etc.,
caption is the title of the table,
align is the alignment of the columns, such as “l” for left, “r” for right, “c” for center, etc.,
col.names is the vector of column names,
row.names is the vector of row names,
digits is the number of decimal places to show, etc.

For example, if you want to display the df data frame as a markdown table, with a caption, a center alignment, and two decimal places, you can use the following code in R:

# Display the df data frame as a markdown table with kable
library(knitr) # Load the knitr package
kable(df, format = "markdown", caption = "A data frame with 10 rows and 12 columns", align = "c", digits = 2)

Display the df data frame as a markdown table with kable

The df data frame will be displayed as a markdown table, with the specified options.

How to use the ggplot2 package to create graphs from a data frame

With some customization options, you can use the ggplot2 package to create graphs from a data frame or a tibble. You need to use the following syntax:

ggplot(data, aes(x, y, color, fill, etc.)) + geom_point, geom_line, geom_bar, etc. + labs(title, x, y, etc.) + theme, scale, etc.

Where

data is the name of the data frame or the tibble,
aes is the aesthetic mapping that defines the variables to plot, such as x, y, color, fill, etc.,
geom_point, geom_line, geom_bar, etc. are the geometric objects that define the type of plot, such as point, line, bar, etc.,
labs are the label for the title, x-axis, y-axis, etc.,
theme, scale, etc. are the options for the appearance, such as theme, scale, etc.

For example, if you want to create a scatter plot of the height and weight variables, with the color and shape based on the gender variable and the size based on the score variable, you can use the following code in R:

df<-cbind(df,df1)
# Create a scatter plot of the height and weight variables with ggplot2
library(ggplot2) # Load the ggplot2 package
ggplot(df, aes(x = height, y = weight, color = gender, shape = gender, size = score)) + # Define the data and the aesthetic mapping
  geom_point() + # Define the geometric object as point
  labs(title = "Height vs Weight Scatter Plot", x = "Height (cm)", y = "Weight (kg)") + # Define the labels for the title and the axes
  theme_bw() # Define the theme as black and white

a scatter plot of the height and weight variables with ggplot2

It will create a scatter plot of the height and weight variables with the specified options.

Conclusion

In this article, I have introduced the mutate function from the dplyr package. You have learned how to use mutate to create or modify new variables in a data frame or tibble. You have also learned how to use the variants of mutate, such as mutate_all, mutate_at, mutate_if, and mutate_across. Finally, you have learned how to use visuals and tables to show the input and output of mutate.

To use mutate, follow the steps and syntax in this article. You can also refer to the examples and code snippets. Explore the documentation and vignettes of the dplyr package to learn more about mutate and its variants. Practice using mutate with your own data sets.

Thank you for reading this article. I hope you have enjoyed learning to use mutate to create new variables in R with dplyr.

If you have any questions, comments, or feedback, please leave them below.

Further Reads

R for Data Science by Hadley Wickham and Garrett Grolemund. This book covers the basics of data transformation with dplyr, including how to use mutate and its variants, with examples and exercises.
Data Manipulation with dplyr by Hadley Wickham. This vignette provides an overview of the dplyr package, its philosophy, and its main functions, such as mutate, group_by, summarise, etc., with examples and code snippets.
Introduction to dplyr by Bradley Boehmke. This tutorial introduces the dplyr package, its grammar, and its functions, such as mutate, filter, arrange, etc., with examples and interactive exercises.
dplyr Cheat Sheet by RStudio. This cheat sheet summarizes the most common and useful functions and options of the dplyr package, such as mutate, select, rename, etc., with examples and diagrams.

Frequently Asked Questions (FAQs)

How do I create a new variable in R?

You can use the case_when function inside mutate to create a new variable based on the condition in R. You need to use the following syntax:

mutate(data, name = case_when(condition1 ~ value1, condition2 ~ value2, etc.))

where data is the name of the data frame or the tibble, name is the name of the new variable, condition1, condition2, etc. are the logical expressions that define the conditions, and value1, value2, etc. are the values for each case.

For example

mutate(df, grade = case_when(score >= 90 ~ "A", score >= 80 ~ "B", score >= 70 ~ "C", TRUE ~ "D"))

will create a new variable called grade in the df data frame based on the value of the score variable.

What command will create new variables with functions of existing variables using dplyr?

You can use any arithmetic or mathematical operators or functions inside mutate to create new variables with functions of existing variables using dplyr. You need to use the existing variables as arguments for the functions.

For example

mutate(df, log_score = log(score))

will create a new variable called log_score in the df data frame based on the natural logarithm of the score variable.

What does %>% mean in R?

The %>% operator, or pipe operator, means “then” in R.

It allows you to chain multiple functions together without nesting them or creating intermediate objects. It passes the output of the left-hand side as the first argument of the right-hand side.

For example

df %>% mutate(age = age + 1) %>% filter(age > 30)

means “take the df data frame, then mutate the age variable by adding 1, then filter the rows where age is greater than 30”.

What is the use of dplyr in R?

The dplyr package is a powerful and user-friendly tool for data manipulation in R. It provides a consistent and intuitive set of functions to perform common data manipulation tasks, such as selecting, filtering, grouping, summarizing, arranging, joining, and mutating data.

It also works well with other packages, such as tidyr, ggplot2, and knitr, to enable a tidy and reproducible data analysis workflow in R.

How do you create a variable?

You can create a variable by assigning a value to a name, using the assignment operator (<- or =) in R. For example

x <- 10

will create a variable called x, with the value 10. As explained above, you can also create a variable by using the mutate function from the dplyr package.

How to add two variables in R?

You can add two variables in R using the addition operator (+).

For example

y <- x + 5

will create a variable called y, with x plus 5. As explained above, You can add two variables using the mutate function from the dplyr package.

How do you create a new variable in the data step?

The data step is a part of the SAS programming language, which differs from R. However, you can create a new variable in the data step by using the assignment statement, similar to R.

For example, data new; set old; z = x + y; run; will create a new data set called new, based on the old data set, and create a new variable called z, with the value of x plus y.

What function helps create new variables in R?

As explained above, the mutate function from the dplyr package helps create new variables in R. You can also use other functions, such as case_when, paste, log, round, etc., inside mutate to create new variables based on conditions, combinations, transformations, etc.

What is the command used to create a new variable?

The command to create a new variable depends on the programming language and the package you are using. In R, you can use the assignment operator (<- or =) or the mutate function from the dplyr package to create a new variable, as explained above.

How do I rename a variable in R using dplyr?

You can rename a variable in R using dplyr by using the rename function. You need to use the following syntax:

rename(data, new_name = old_name)

where data is the name of the data frame or the tibble, new_name is the new name of the variable, and old_name is the old name of the variable.

For example

rename(df, score_new = score)

will rename the score variable to score_new in the df data frame.

You can use the mutate function from the dplyr package to create a new variable in R. You need to use the following syntax:

mutate(data, name = value)

Where data is the name of the data frame or the tibble, name is the name of the new variable, and value is the expression that defines the value of the new variable. For example

mutate(df, score = 100 * rand())

will create a new variable called score in the df data frame, with random values between 0 and 100.

Which dplyr operation is used to add new variables to a data set?

The mutate function is the dplyr operation that adds new variables to a data set. You can also use the variants of mutate, such as mutate_all, mutate_at, mutate_if, and mutate_across, to apply functions to all, selected or conditional variables or variables that match a pattern.

Create New Variables in R with dplyr.zipR code and Output285kB

Need a Customized solution for your data analysis projects? Are you interested in learning through Zoom? Hire me as your data analyst. I have five years of experience and a PhD. I can help you with data analysis projects and problems using R and other tools. To hire me, you can visit this link and fill out the order form. You can also contact me at [email protected] for any questions or inquiries. I will be happy to work with you and provide you with high-quality data analysis services.

Join Our Community Allow us to Assist You.

Related

To leave a comment for the author, please follow the link and comment on their blog: RStudioDataLab.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Copyright © 2024 | MH Corporate basic by MH Themes