Site icon R-bloggers

File Management With The {fs} Package

[This article was first published on Albert Rapp, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

  • As data scientists we often have to deal with lots of tedious tasks. One such tedious task can be interacting with the file system on our computer or the remote machine we’re working with. Thankfully, the {fs} package has a bunch of convenvience function that make our life a whole lot easier.

    Let’s check out a few examples. And if videos are more your thing, you can also watch the video version of this blog post on YouTube.

    < section id="assemble-paths" class="level2">

    Assemble paths

    Check out this data set.

    library(tidyverse)
    library(fs)
    original_tib <- tibble(
      dir = c('some/path/blub', 'bla/here/', 'direct/'),
      file_names = c('file_a.csv', 'file_b.csv', 'file_c.txt')
    )
    original_tib
    ## # A tibble: 3 × 2
    ##   dir            file_names
    ##   <chr>          <chr>     
    ## 1 some/path/blub file_a.csv
    ## 2 bla/here/      file_b.csv
    ## 3 direct/        file_c.txt

    Here, assembling a path in the form directory/file_name.ext can be tricky. Some directories have trailing / and some don’t. So, working with paste0() or glue::glue() would be challenging. Thankfully, the path() function from the {fs} package doesn’t care whether trailing / are there or not.

    original_tib |>
      mutate(path = path(dir, file_names))
    ## # A tibble: 3 × 3
    ##   dir            file_names path                     
    ##   <chr>          <chr>      <fs::path>               
    ## 1 some/path/blub file_a.csv some/path/blub/file_a.csv
    ## 2 bla/here/      file_b.csv bla/here/file_b.csv      
    ## 3 direct/        file_c.txt direct/file_c.txt
    < section id="remove-and-set-extensions" class="level2">

    Remove and set extensions

    We can even modify file extensions really easily. That’s convenient when we want to take input from csv-files and then turn the data into images using the same file names.

    original_tib |>
      mutate(
        path = path(dir, file_names),
        out_path = path_ext_set(path, 'png')
      )
    ## # A tibble: 3 × 4
    ##   dir            file_names path                      out_path                 
    ##   <chr>          <chr>      <fs::path>                <fs::path>               
    ## 1 some/path/blub file_a.csv some/path/blub/file_a.csv some/path/blub/file_a.png
    ## 2 bla/here/      file_b.csv bla/here/file_b.csv       bla/here/file_b.png      
    ## 3 direct/        file_c.txt direct/file_c.txt         direct/file_c.png
    < section id="get-directory-infos" class="level2">

    Get directory infos

    You can get information on a directory as a tree in the console. Here, I’m using a directory called raw-input inside my working directory to demonstrate that.

    dir_tree('raw-input')
    ## raw-input
    ## ├── a
    ## │   └── dat.csv
    ## ├── b
    ## │   └── dat.csv
    ## └── c
    ##     └── dat.csv

    You can also get lots of information on these files.

    dir_info('raw-input')
    ## # A tibble: 3 × 18
    ##   path        type    size permissions modification_time   user  group device_id
    ##   <fs::path>  <fct>  <fs:> <fs::perms> <dttm>              <chr> <chr>     <dbl>
    ## 1 raw-input/a direc…    4K rwxrwxr-x   2025-03-29 09:02:24 albe… albe…     66307
    ## 2 raw-input/b direc…    4K rwxrwxr-x   2025-03-29 09:04:33 albe… albe…     66307
    ## 3 raw-input/c direc…    4K rwxrwxr-x   2025-03-29 09:04:35 albe… albe…     66307
    ## # ℹ 10 more variables: hard_links <dbl>, special_device_id <dbl>, inode <dbl>,
    ## #   block_size <dbl>, blocks <dbl>, flags <int>, generation <dbl>,
    ## #   access_time <dttm>, change_time <dttm>, birth_time <dttm>

    But in a lot of cases, it will probably suffice to just get the file paths.

    dir_ls('raw-input')
    ## raw-input/a raw-input/b raw-input/c

    In this function, you’ll need to use recurse = TRUE, though, to go into nested structures.

    dir_ls('raw-input', recurse = TRUE)
    ## raw-input/a         raw-input/a/dat.csv raw-input/b         raw-input/b/dat.csv 
    ## raw-input/c         raw-input/c/dat.csv
    < section id="iterate-over-file-paths" class="level2">

    Iterate over file paths

    Usually, you don’t want to stop after finding the desired paths. You usually want to iterate over them. For this, you can save the output of dir_ls() into a vector and iterate through it using the map() or walk() function. Here, the function I use inside of walk() will

    • load the data using the specified path,
    • create a ggplot from it, and
    • save the image.

    The tricky thing here is that I do want to save the files in an output directory. It is supposed to have the same structure as the raw-input directory. That’s why I also need to create the necessary paths and directories for that inside the function.

    csv_files <- dir_ls(
      'raw-input',
      recurse = TRUE,
      regexp = '\\.csv$'
    )
    
    csv_files |>
      walk(
        \(file_path) {
          plt <- read_csv(file_path) |>
            ggplot(aes(col_a, col_b)) +
            geom_point(size = 10, col = 'dodgerblue4')
    
          out_path <- file_path |>
            path_ext_set('.png') |>
            str_replace('^raw-input', 'output')
    
          dir_create(path_dir(out_path))
          ggsave(filename = out_path)
        }
      )
    ## Rows: 3 Columns: 3
    ## ── Column specification ────────────────────────────────────────────────────────
    ## Delimiter: ","
    ## dbl (3): col_a, col_b, col_c
    ## 
    ## ℹ Use `spec()` to retrieve the full column specification for this data.
    ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    ## Saving 6 x 4 in image
    ## Rows: 3 Columns: 3
    ## ── Column specification ────────────────────────────────────────────────────────
    ## Delimiter: ","
    ## dbl (3): col_a, col_b, col_c
    ## 
    ## ℹ Use `spec()` to retrieve the full column specification for this data.
    ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    ## Saving 6 x 4 in image
    ## Rows: 3 Columns: 3
    ## ── Column specification ────────────────────────────────────────────────────────
    ## Delimiter: ","
    ## dbl (3): col_a, col_b, col_c
    ## 
    ## ℹ Use `spec()` to retrieve the full column specification for this data.
    ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    ## Saving 6 x 4 in image

    Splendid. This should have worked and you can now see the output directory and the plots in the file tree.

    dir_tree()
    ## .
    ## ├── index.qmd
    ## ├── index.rmarkdown
    ## ├── output
    ## │   ├── a
    ## │   │   └── dat.png
    ## │   ├── b
    ## │   │   └── dat.png
    ## │   └── c
    ## │       └── dat.png
    ## └── raw-input
    ##     ├── a
    ##     │   └── dat.csv
    ##     ├── b
    ##     │   └── dat.csv
    ##     └── c
    ##         └── dat.csv
    To leave a comment for the author, please follow the link and comment on their blog: Albert Rapp.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
  • Exit mobile version