In the world of data analysis and manipulation, tidying and reshaping data is often an essential step. R’s
tidyr library provides powerful tools to efficiently transform and reshape data. One such function is
pivot_longer(). In this blog post, we’ll explore how
pivot_longer() works and demonstrate its usage through several examples. By the end, you’ll have a solid understanding of how to use this function to make your data more manageable and insightful.
tidyr library holds the function, so we are going to have to load it first.
pivot_longer() function is designed to reshape data from a wider format to a longer format. It takes columns that represent different variables and consolidates them into key-value pairs, making it easier to analyze and visualize the data.
Syntax: The basic syntax of
pivot_longer() is as follows:
pivot_longer(data, cols, names_to, values_to)
data: The data frame or tibble to be reshaped.
cols: The columns to be transformed.
names_to: The name of the new column that will hold the variable names.
values_to: The name of the new column that will hold the corresponding values.
Example 1: Reshaping Wide Data to Long Data
Let’s start with a simple example to demonstrate the usage of
pivot_longer(). Suppose we have a data frame called
students with columns representing subjects and their respective scores:
students <- data.frame( name = c("Alice", "Bob", "Charlie"), math = c(90, 85, 92), science = c(95, 88, 91), history = c(87, 92, 78) )
To reshape this data from a wider format to a longer format, we can use
pivot_longer() as follows:
students_long <- pivot_longer( students, cols = -name, names_to = "subject", values_to = "score" ) students_long
# A tibble: 9 × 3 name subject score <chr> <chr> <dbl> 1 Alice math 90 2 Alice science 95 3 Alice history 87 4 Bob math 85 5 Bob science 88 6 Bob history 92 7 Charlie math 92 8 Charlie science 91 9 Charlie history 78
students_long data frame will have three columns:
score, where each row represents a student’s score in a specific subject.
Example 2: Handling Multiple Variables In many cases, data frames contain multiple variables that need to be pivoted simultaneously. Consider a data frame called
sales with columns representing sales figures for different products in different regions:
sales <- data.frame( region = c("North", "South", "East"), product_A = c(100, 120, 150), product_B = c(80, 90, 110), product_C = c(60, 70, 80) )
To reshape this data, we can specify multiple columns to pivot using
sales_long <- pivot_longer( sales, cols = starts_with("product"), names_to = "product", values_to = "sales" ) sales_long
# A tibble: 9 × 3 region product sales <chr> <chr> <dbl> 1 North product_A 100 2 North product_B 80 3 North product_C 60 4 South product_A 120 5 South product_B 90 6 South product_C 70 7 East product_A 150 8 East product_B 110 9 East product_C 80
sales_long data frame will have three columns:
sales, where each row represents the sales figure of a specific product in a particular region.
Example 3: Handling Irregular Data
Sometimes, data frames contain irregular structures, such as missing values or uneven numbers of columns.
pivot_longer() can handle such scenarios gracefully. Consider a data frame called
measurements with columns representing different measurement types and their respective values:
measurements <- data.frame( timestamp = c("2022-01-01", "2022-01-02", "2022-01-03"), temperature = c(25.3, 27.1, 24.8), humidity = c(65.2, NA, 68.5), pressure = c(1013, 1012, NA) )
To reshape this data, we can use
pivot_longer() and handle the missing values:
measurements_long <- pivot_longer( measurements, cols = -timestamp, names_to = "measurement", values_to = "value", values_drop_na = TRUE ) measurements_long
# A tibble: 7 × 3 timestamp measurement value <chr> <chr> <dbl> 1 2022-01-01 temperature 25.3 2 2022-01-01 humidity 65.2 3 2022-01-01 pressure 1013 4 2022-01-02 temperature 27.1 5 2022-01-02 pressure 1012 6 2022-01-03 temperature 24.8 7 2022-01-03 humidity 68.5
measurements_long data frame will have three columns:
value, where each row represents a specific measurement at a particular timestamp. The
values_drop_na argument ensures that rows with missing values are dropped.
In this blog post, we explored the
pivot_longer() function from the tidyr library, which allows us to reshape data from a wider format to a longer format. We covered the syntax and provided several examples to illustrate its usage. By mastering
pivot_longer(), you’ll be equipped to tidy your data and unleash its true potential for analysis and visualization.