Efficiently merging multiple datasets
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Because of delays with my scholarship payment, if this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee that shall be used to continue my Open Source efforts. If you need an R package or Shiny dashboard for your team, you can email me or inquiry on Fiverr. The full explanation is here: A Personal Message from an Open Source Contributor
You can send me questions for the blog using this form and subscribe to receive an email when there is a new post.
I tend to observe that my students repeat a lot of code when we have partitioned data across different tables. For example, when we have to read multiple CSV files that result in a structure similar to the following tables created by intentionally subsetting the original Palmer’s Penguins dataset:
library(palmerpenguins)
Attaching package: 'palmerpenguins'
The following objects are masked from 'package:datasets': penguins, penguins_raw
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats': filter, lag
The following objects are masked from 'package:base': intersect, setdiff, setequal, union
library(purrr) species <- penguins %>% distinct(species) %>% pull() map( species, function(s) { assign( paste0("penguins_", tolower(s)), filter(penguins, species == s), envir = .GlobalEnv ) } )
[[1]] # A tibble: 152 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # ℹ 142 more rows # ℹ 2 more variables: sex <fct>, year <int> [[2]] # A tibble: 124 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Gentoo Biscoe 46.1 13.2 211 4500 2 Gentoo Biscoe 50 16.3 230 5700 3 Gentoo Biscoe 48.7 14.1 210 4450 4 Gentoo Biscoe 50 15.2 218 5700 5 Gentoo Biscoe 47.6 14.5 215 5400 6 Gentoo Biscoe 46.5 13.5 210 4550 7 Gentoo Biscoe 45.4 14.6 211 4800 8 Gentoo Biscoe 46.7 15.3 219 5200 9 Gentoo Biscoe 43.3 13.4 209 4400 10 Gentoo Biscoe 46.8 15.4 215 5150 # ℹ 114 more rows # ℹ 2 more variables: sex <fct>, year <int> [[3]] # A tibble: 68 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Chinstrap Dream 46.5 17.9 192 3500 2 Chinstrap Dream 50 19.5 196 3900 3 Chinstrap Dream 51.3 19.2 193 3650 4 Chinstrap Dream 45.4 18.7 188 3525 5 Chinstrap Dream 52.7 19.8 197 3725 6 Chinstrap Dream 45.2 17.8 198 3950 7 Chinstrap Dream 46.1 18.2 178 3250 8 Chinstrap Dream 51.3 18.2 197 3750 9 Chinstrap Dream 46 18.9 195 4150 10 Chinstrap Dream 51.3 19.9 198 3700 # ℹ 58 more rows # ℹ 2 more variables: sex <fct>, year <int>
penguins_adelie
# A tibble: 152 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # ℹ 142 more rows # ℹ 2 more variables: sex <fct>, year <int>
penguins_chinstrap
# A tibble: 68 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Chinstrap Dream 46.5 17.9 192 3500 2 Chinstrap Dream 50 19.5 196 3900 3 Chinstrap Dream 51.3 19.2 193 3650 4 Chinstrap Dream 45.4 18.7 188 3525 5 Chinstrap Dream 52.7 19.8 197 3725 6 Chinstrap Dream 45.2 17.8 198 3950 7 Chinstrap Dream 46.1 18.2 178 3250 8 Chinstrap Dream 51.3 18.2 197 3750 9 Chinstrap Dream 46 18.9 195 4150 10 Chinstrap Dream 51.3 19.9 198 3700 # ℹ 58 more rows # ℹ 2 more variables: sex <fct>, year <int>
penguins_gentoo
# A tibble: 124 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Gentoo Biscoe 46.1 13.2 211 4500 2 Gentoo Biscoe 50 16.3 230 5700 3 Gentoo Biscoe 48.7 14.1 210 4450 4 Gentoo Biscoe 50 15.2 218 5700 5 Gentoo Biscoe 47.6 14.5 215 5400 6 Gentoo Biscoe 46.5 13.5 210 4550 7 Gentoo Biscoe 45.4 14.6 211 4800 8 Gentoo Biscoe 46.7 15.3 219 5200 9 Gentoo Biscoe 43.3 13.4 209 4400 10 Gentoo Biscoe 46.8 15.4 215 5150 # ℹ 114 more rows # ℹ 2 more variables: sex <fct>, year <int>
One common operation I have seen when my students use dplyr is a variation of this:
penguins_adelie %>% bind_rows(penguins_chinstrap) %>% bind_rows(penguins_gentoo)
# A tibble: 344 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # ℹ 334 more rows # ℹ 2 more variables: sex <fct>, year <int>
bind_rows(penguins_adelie, penguins_chinstrap, penguins_gentoo)
# A tibble: 344 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # ℹ 334 more rows # ℹ 2 more variables: sex <fct>, year <int>
A similar pattern appears when they decide to use base R with something like this
rbind(penguins_adelie, penguins_chinstrap, penguins_gentoo)
# A tibble: 344 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # ℹ 334 more rows # ℹ 2 more variables: sex <fct>, year <int>
There are many ways to achieve the same result in R, and one option is to use the “{}” operator:
# either {penguins_adelie; penguins_chinstrap; penguins_gentoo} %>% list() %>% bind_rows()
# A tibble: 124 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Gentoo Biscoe 46.1 13.2 211 4500 2 Gentoo Biscoe 50 16.3 230 5700 3 Gentoo Biscoe 48.7 14.1 210 4450 4 Gentoo Biscoe 50 15.2 218 5700 5 Gentoo Biscoe 47.6 14.5 215 5400 6 Gentoo Biscoe 46.5 13.5 210 4550 7 Gentoo Biscoe 45.4 14.6 211 4800 8 Gentoo Biscoe 46.7 15.3 219 5200 9 Gentoo Biscoe 43.3 13.4 209 4400 10 Gentoo Biscoe 46.8 15.4 215 5150 # ℹ 114 more rows # ℹ 2 more variables: sex <fct>, year <int>
# or { penguins_adelie penguins_chinstrap penguins_gentoo } %>% bind_rows()
# A tibble: 124 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Gentoo Biscoe 46.1 13.2 211 4500 2 Gentoo Biscoe 50 16.3 230 5700 3 Gentoo Biscoe 48.7 14.1 210 4450 4 Gentoo Biscoe 50 15.2 218 5700 5 Gentoo Biscoe 47.6 14.5 215 5400 6 Gentoo Biscoe 46.5 13.5 210 4550 7 Gentoo Biscoe 45.4 14.6 211 4800 8 Gentoo Biscoe 46.7 15.3 219 5200 9 Gentoo Biscoe 43.3 13.4 209 4400 10 Gentoo Biscoe 46.8 15.4 215 5150 # ℹ 114 more rows # ℹ 2 more variables: sex <fct>, year <int>
The previous examples quickly become impractical when dealing with a large number of datasets. In such cases, we can leverage the purrr
package to streamline the process.
One practical example is the UN Comtrade data, which is partitioned by country and year into thousands of CSV files (at some point I saved those as RDS to save disk space). By using map_df
to efficiently merge multiple datasets without repeating code, we can do the following:
# list Canadian data 1990-1999 canada_files <- list.files("~/Documents/uncomtrade/input/data", pattern = "_CAN_199", full.names = TRUE) head(canada_files) canada <- map_df( canada_files, readRDS ) canada > canada # A tibble: 1,971,197 × 34 dataset_code type_code freq_code ref_period_id ref_year ref_month period <chr> <chr> <chr> <chr> <chr> <chr> <chr> 1 20124199001052100 C A 19900101 1990 52 1990 2 20124199001052100 C A 19900101 1990 52 1990 3 20124199001052100 C A 19900101 1990 52 1990 4 20124199001052100 C A 19900101 1990 52 1990 5 20124199001052100 C A 19900101 1990 52 1990 6 20124199001052100 C A 19900101 1990 52 1990 7 20124199001052100 C A 19900101 1990 52 1990 8 20124199001052100 C A 19900101 1990 52 1990 9 20124199001052100 C A 19900101 1990 52 1990 10 20124199001052100 C A 19900101 1990 52 1990 # ℹ 1,971,187 more rows # ℹ 27 more variables: reporter_code <chr>, flow_code <chr>, # partner_code <chr>, partner2code <chr>, classification_search_code <chr>, # classification_code <chr>, is_original_classification <chr>, # cmd_code <chr>, customs_code <chr>, mos_code <chr>, mot_code <chr>, # qty_unit_code <chr>, qty <chr>, is_qty_estimated <chr>, # alt_qty_unit_code <chr>, alt_qty <chr>, is_alt_qty_estimated <chr>, … # ℹ Use `print(n = ...)` to see more rows
Then it is possible to conduct diferent operations, such as total bilateral exports reporting FOB and CIF to see the mismatch between the exporter and the importer:
canada %>% select(partner_code, fobvalue, cifvalue) %>% mutate( fobvalue = as.numeric(fobvalue), cifvalue = as.numeric(cifvalue) ) %>% group_by(partner_code) %>% summarise( fobvalue = sum(fobvalue, na.rm = TRUE), cifvalue = sum(cifvalue, na.rm = TRUE) ) > canada %>% + select(partner_code, fobvalue, cifvalue) %>% + mutate( + fobvalue = as.numeric(fobvalue), + cifvalue = as.numeric(cifvalue) + ) %>% + group_by(partner_code) %>% + summarise( + fobvalue = sum(fobvalue, na.rm = TRUE), + cifvalue = sum(cifvalue, na.rm = TRUE) + ) # A tibble: 219 × 3 partner_code fobvalue cifvalue <chr> <dbl> <dbl> 1 0 1.34e13 0 2 100 1.44e 9 0 3 104 4.18e 8 0 4 108 2.58e 7 0 5 112 1.26e 8 0 6 116 1.77e 8 0 7 12 2.22e10 0 8 120 7.34e 8 0 9 132 6.95e 6 0 10 136 2.32e 8 0 # ℹ 209 more rows # ℹ Use `print(n = ...)` to see more rows
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.