Site icon R-bloggers

Efficiently merging multiple datasets

[This article was first published on https://pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !DOCTYPE html> < charset="utf-8"> < http-equiv="X-UA-Compatible" content="IE=edge"> < name="viewport" content="width=device-width, initial-scale=1.0"> pacha.dev/blog < !-- MathJax Configuration --> < !-- Smart header: libraries detected based on content --> < !-- File: /tmp/tmp.W1GdSunucG/index.html -->
  • < !-- DEBUG: Found sourceCode --> < !-- Load custom CSS after any library CSS to ensure proper precedence -->
  • < header class="site-top">

    Mauricio “Pachá” Vargas Sepúlveda

    Blog with notes about R, Shiny, SQL, Python, Linux and C++. This blog is listed on R-Bloggers.

    HOME 🏠
    < !-- categories are printed below this--> < nav class="sidebar-nav">

    Categories

    < header id="title-block-header" class="quarto-title-block default">

    Efficiently merging multiple datasets

    Using purrr to avoid repeating code.
    Author

    Mauricio “Pachá” Vargas S.

    Published

    August 31, 2025

    Because of delays with my scholarship payment, if this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee that shall be used to continue my Open Source efforts. If you need an R package or Shiny dashboard for your team, you can email me or inquiry on Fiverr. The full explanation is here: A Personal Message from an Open Source Contributor

    You can send me questions for the blog using this form and subscribe to receive an email when there is a new post.

    Updated 2025-09-05: I removed the examples with the curly bracket operator. Those were incorrect. Thanks a lot to Kent Johnson for pointing that out.

    I tend to observe that my students repeat a lot of code when we have partitioned data across different tables. For example, when we have to read multiple CSV files that result in a structure similar to the following tables created by intentionally subsetting the original Palmer’s Penguins dataset:

    library(palmerpenguins)
    library(dplyr)
    library(purrr)
    
    species <- penguins %>%
      distinct(species) %>%
      pull()
    
    map(
      species,
      function(s) {
        assign(
          paste0("penguins_", tolower(s)),
          filter(penguins, species == s),
          envir = .GlobalEnv
        )
      }
    )
    [[1]]
    # A tibble: 152 × 8
       species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
       <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
     1 Adelie  Torgersen           39.1          18.7               181        3750
     2 Adelie  Torgersen           39.5          17.4               186        3800
     3 Adelie  Torgersen           40.3          18                 195        3250
     4 Adelie  Torgersen           NA            NA                  NA          NA
     5 Adelie  Torgersen           36.7          19.3               193        3450
     6 Adelie  Torgersen           39.3          20.6               190        3650
     7 Adelie  Torgersen           38.9          17.8               181        3625
     8 Adelie  Torgersen           39.2          19.6               195        4675
     9 Adelie  Torgersen           34.1          18.1               193        3475
    10 Adelie  Torgersen           42            20.2               190        4250
    # ℹ 142 more rows
    # ℹ 2 more variables: sex <fct>, year <int>
    
    [[2]]
    # A tibble: 124 × 8
       species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
       <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
     1 Gentoo  Biscoe           46.1          13.2               211        4500
     2 Gentoo  Biscoe           50            16.3               230        5700
     3 Gentoo  Biscoe           48.7          14.1               210        4450
     4 Gentoo  Biscoe           50            15.2               218        5700
     5 Gentoo  Biscoe           47.6          14.5               215        5400
     6 Gentoo  Biscoe           46.5          13.5               210        4550
     7 Gentoo  Biscoe           45.4          14.6               211        4800
     8 Gentoo  Biscoe           46.7          15.3               219        5200
     9 Gentoo  Biscoe           43.3          13.4               209        4400
    10 Gentoo  Biscoe           46.8          15.4               215        5150
    # ℹ 114 more rows
    # ℹ 2 more variables: sex <fct>, year <int>
    
    [[3]]
    # A tibble: 68 × 8
       species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
       <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
     1 Chinstrap Dream            46.5          17.9               192        3500
     2 Chinstrap Dream            50            19.5               196        3900
     3 Chinstrap Dream            51.3          19.2               193        3650
     4 Chinstrap Dream            45.4          18.7               188        3525
     5 Chinstrap Dream            52.7          19.8               197        3725
     6 Chinstrap Dream            45.2          17.8               198        3950
     7 Chinstrap Dream            46.1          18.2               178        3250
     8 Chinstrap Dream            51.3          18.2               197        3750
     9 Chinstrap Dream            46            18.9               195        4150
    10 Chinstrap Dream            51.3          19.9               198        3700
    # ℹ 58 more rows
    # ℹ 2 more variables: sex <fct>, year <int>

    One common operation I have seen when my students use dplyr is a variation of this:

    penguins_adelie %>%
      bind_rows(penguins_chinstrap) %>% 
      bind_rows(penguins_gentoo)
    # A tibble: 344 × 8
       species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
       <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
     1 Adelie  Torgersen           39.1          18.7               181        3750
     2 Adelie  Torgersen           39.5          17.4               186        3800
     3 Adelie  Torgersen           40.3          18                 195        3250
     4 Adelie  Torgersen           NA            NA                  NA          NA
     5 Adelie  Torgersen           36.7          19.3               193        3450
     6 Adelie  Torgersen           39.3          20.6               190        3650
     7 Adelie  Torgersen           38.9          17.8               181        3625
     8 Adelie  Torgersen           39.2          19.6               195        4675
     9 Adelie  Torgersen           34.1          18.1               193        3475
    10 Adelie  Torgersen           42            20.2               190        4250
    # ℹ 334 more rows
    # ℹ 2 more variables: sex <fct>, year <int>
    bind_rows(penguins_adelie, penguins_chinstrap, penguins_gentoo)
    # A tibble: 344 × 8
       species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
       <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
     1 Adelie  Torgersen           39.1          18.7               181        3750
     2 Adelie  Torgersen           39.5          17.4               186        3800
     3 Adelie  Torgersen           40.3          18                 195        3250
     4 Adelie  Torgersen           NA            NA                  NA          NA
     5 Adelie  Torgersen           36.7          19.3               193        3450
     6 Adelie  Torgersen           39.3          20.6               190        3650
     7 Adelie  Torgersen           38.9          17.8               181        3625
     8 Adelie  Torgersen           39.2          19.6               195        4675
     9 Adelie  Torgersen           34.1          18.1               193        3475
    10 Adelie  Torgersen           42            20.2               190        4250
    # ℹ 334 more rows
    # ℹ 2 more variables: sex <fct>, year <int>

    A similar pattern appears when they decide to use base R with something like this

    rbind(penguins_adelie, penguins_chinstrap, penguins_gentoo)
    # A tibble: 344 × 8
       species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
       <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
     1 Adelie  Torgersen           39.1          18.7               181        3750
     2 Adelie  Torgersen           39.5          17.4               186        3800
     3 Adelie  Torgersen           40.3          18                 195        3250
     4 Adelie  Torgersen           NA            NA                  NA          NA
     5 Adelie  Torgersen           36.7          19.3               193        3450
     6 Adelie  Torgersen           39.3          20.6               190        3650
     7 Adelie  Torgersen           38.9          17.8               181        3625
     8 Adelie  Torgersen           39.2          19.6               195        4675
     9 Adelie  Torgersen           34.1          18.1               193        3475
    10 Adelie  Torgersen           42            20.2               190        4250
    # ℹ 334 more rows
    # ℹ 2 more variables: sex <fct>, year <int>

    The previous examples quickly become impractical when dealing with a large number of datasets. In such cases, we can leverage the purrr package to streamline the process.

    One practical example is the UN Comtrade data, which is partitioned by country and year into thousands of CSV files (at some point I saved those as RDS to save disk space). By using map_df to efficiently merge multiple datasets without repeating code, we can do the following:

    # list Canadian data 1990-1999
    canada_files <- list.files("~/Documents/uncomtrade/input/data", pattern = "_CAN_199",
      full.names = TRUE)
    
    head(canada_files)
    
    canada <- map_df(
      canada_files,
      readRDS
    )
    
    canada
    
    > canada
    # A tibble: 1,971,197 × 34
       dataset_code      type_code freq_code ref_period_id ref_year ref_month period
       <chr>             <chr>     <chr>     <chr>         <chr>    <chr>     <chr> 
     1 20124199001052100 C         A         19900101      1990     52        1990  
     2 20124199001052100 C         A         19900101      1990     52        1990  
     3 20124199001052100 C         A         19900101      1990     52        1990  
     4 20124199001052100 C         A         19900101      1990     52        1990  
     5 20124199001052100 C         A         19900101      1990     52        1990  
     6 20124199001052100 C         A         19900101      1990     52        1990  
     7 20124199001052100 C         A         19900101      1990     52        1990  
     8 20124199001052100 C         A         19900101      1990     52        1990  
     9 20124199001052100 C         A         19900101      1990     52        1990  
    10 20124199001052100 C         A         19900101      1990     52        1990  
    # ℹ 1,971,187 more rows
    # ℹ 27 more variables: reporter_code <chr>, flow_code <chr>,
    #   partner_code <chr>, partner2code <chr>, classification_search_code <chr>,
    #   classification_code <chr>, is_original_classification <chr>,
    #   cmd_code <chr>, customs_code <chr>, mos_code <chr>, mot_code <chr>,
    #   qty_unit_code <chr>, qty <chr>, is_qty_estimated <chr>,
    #   alt_qty_unit_code <chr>, alt_qty <chr>, is_alt_qty_estimated <chr>, …
    # ℹ Use `print(n = ...)` to see more rows

    Then it is possible to conduct diferent operations, such as total bilateral exports reporting FOB and CIF to see the mismatch between the exporter and the importer:

    canada %>%
      mutate(
        fobvalue = as.numeric(fobvalue),
        cifvalue = as.numeric(cifvalue)
      ) %>%
      group_by(ref_year, partner_code) %>%
      summarise(
        fobvalue = sum(fobvalue, na.rm = TRUE),
        cifvalue = sum(cifvalue, na.rm = TRUE)
      )
    
    > canada %>%
    +   mutate(
    +     fobvalue = as.numeric(fobvalue),
    +     cifvalue = as.numeric(cifvalue)
    +   ) %>%
    +   group_by(ref_year, partner_code) %>%
    +   summarise(
    +     fobvalue = sum(fobvalue, na.rm = TRUE),
    +     cifvalue = sum(cifvalue, na.rm = TRUE)
    +   )
    # A tibble: 1,982 × 4
    # Groups:   ref_year [10]
       ref_year partner_code     fobvalue cifvalue
       <chr>    <chr>               <dbl>    <dbl>
     1 1990     0            973632412956        0
     2 1990     100             128401820        0
     3 1990     104              14749536        0
     4 1990     108               1677908        0
     5 1990     116                 62152        0
     6 1990     12             1210977900        0
     7 1990     120             113571240        0
     8 1990     132                882144        0
     9 1990     136               9460544        0
    10 1990     140                166824        0
    # ℹ 1,972 more rows
    # ℹ Use `print(n = ...)` to see more rows
    < footer>

    Loading…

  • < !-- Load shared sidebar -->
    To leave a comment for the author, please follow the link and comment on their blog: https://pacha.dev/blog.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
    Exit mobile version