Comparing performances of CSV to RDS, Parquet, and Feather file formats in R

Posted on May 8, 2022 by tomaztsql in R bloggers | 0 Comments

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

From the previous blogpost:
– CSV or alternatives? Exporting data from SQL Server data to ORC, AVRO, Parquet, Feather files and store them into Azure data lake

we have created Azure blob storage, connected secure connection using Python and started uploading files to blob store from SQL Server. Alongside, we compared the performance of different file types. ORC, AVRO, Parquet, CSV and Feather. Coming to conclusion, CSV is great for its readability, but not suitable (as a file format) for all types of workloads.

We will be doing a similar benchmark with R language. The goal is to see, if CSV file format can be replaced by a file type that better, both in performance and storage.

Creating a random dataset (10K, 1MIO rows with 30 columns) we have tested the speed of writing and speed of reading from local machine. For the benchmark, we have used package microbenchmark. File formats were: CSV, Feather, Parquet and Rdata (or RDS).

This is the result of running 10 times each operation (writing and reading) for the 10K rows.

The test was performed with R version 4.1.2 and M1 MacOS. For the 10.000 rows (with 30 columns), I have the average CSV size 3,3MiB and Feather and Parquet circa 1,4MiB, and less than 1MiB for RData and rds R format. By far the best reads and write times the tests show a big advantage with Feather file format, following with Parquet and data.table package (reading from CSV file).

Test was conducted using a sample dataframe:

#file names
file_csv <- 'test_df.csv'
file_csv_readr <- 'test_df_readr.csv'
file_csv_datatable <- 'test_df_datatable.csv'
file_feather <- 'test_df.feather'
file_rdata <- 'test_df.RData'
file_rds <- 'test_df.rds'
file_parquet <- 'test_df.parquet'

files <- file.info(c('test_df.csv','test_df_readr.csv','test_df_datatable.csv', 'test_df.feather', 'test_df.RData', 'test_df.rds', 'test_df.parquet'))

test_df <- data.frame(
                replicate(10, sample(0:10000, nof_rows, rep = TRUE)),
                replicate(10, runif(nof_rows, 0.0, 1.0)),
                replicate(10, stri_rand_strings(1000, 10))
                 )

and running benchmark:

benchmark_write <- data.frame(summary(microbenchmark(
          "test_df.csv"     = write.csv(test_df, file = file_csv),
          "test_df_readr.csv"     = readr::write_csv(test_df, file = file_csv_readr),
          "test_df_datatable.csv"     = data.table::fwrite(test_df, file = file_csv_datatable),
          "test_df.feather" = write_feather(test_df, file_feather),
          "test_df.parquet" = write_parquet(test_df, file_parquet),
          "test_df.rds"     = save(test_df, file = file_rdata),
          "test_df.RData"   = saveRDS(test_df, file_rds), 
  times = nof_repeat)))

Same was repeated for reading operation and results were merged:

benchmark_read <- data.frame(summary(microbenchmark(
                            "test_df.csv" = read.csv(file_csv),
                            "test_df_readr.csv" = readr::read_csv(file_csv_readr),
                            "test_df_datatable.csv" = data.table::fread(file_csv_datatable),
                            "test_df.rds"  = load(file_rdata),
                            "test_df.RData" = readRDS(file_rds),
                            "test_df.feather" = arrow::read_feather(file_feather),
                            "test_df.parquet" = read_parquet(file_parquet), times = nof_repeat)))

colnames(benchmark_read) <- c("names", "read_min", "read_lq", "read_mean", "read_median", "read_uq", "read_max", "read_repeat")


#merge results and create factors
results <- inner_join(inner_join(benchmark_read, files, by = "names"), benchmark_write, by = "names")
results <- results[,c("names","size_mb","read_min", "read_max", "read_median","write_min","write_max", "write_median")]

And finally, have plotted the results for the graph above:

ibrary(glue)
library(ggtext)

title_lab_adjusted <- glue::glue(
  "File types comparison on<br><span style = 'color:red;'>read operation</span> and <br><span style='color:darkgreen';>write operation</span>") 

ggplot(results, aes(x=names, y=size_mb)) + 
     geom_bar(stat="identity", fill="lightblue") +
     geom_text(aes(label=paste0(format(round(size_mb, 2), nsmall = 2), " MiB", collapse=NULL)), vjust=-0.3, size=3.5)+
     theme(axis.text.x = element_text(angle = 45, hjust = 1.3)) +
     coord_cartesian(ylim = c(0, 5), expand = F) +
     scale_y_continuous(breaks = seq(0, 5, 1),labels = scales::label_comma(accuracy = 1)) +
    theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(size = 0.1, linetype = "solid", colour = "grey50")) +
    ylab(label = 'Time (sec.) + File_Size') +  xlab(label = 'Files') +
  labs(title = title_lab_adjusted) +
  theme(
    plot.title = element_markdown(),
    panel.background = element_rect(color = NA, fill = 'white')) +

  geom_point (aes(y=write_median/100, group=names),
        col = "darkgreen",
        size = 2,
        stat ="identity",
        alpha=.8) +
  geom_point(aes(y=read_median/100, group=names),
           col = "red",
           size = 2,
           stat ="identity",
           alpha=.8 )

To do a little bit more stress on the CPU, I have created a data.frame with 1Mio rows.

When comparing file size, we can see that CSV files are all around 330MiB in size, RData and RDS around 80MiB and Feather/Parquet around 140MiB. So CSV is already 2x bigger than Parquet and 4x bigger than RData file.

Looking into performance (median for write/read), we can see Feather is by far the most efficient file format. out of 10 runs, reading the complete dataset (1Mio rows), took only 0.4 seconds, whereas CSV was lagging behind significantly. With write operations, Feather’s file format was also the fastest.

Exploring the CSV file format, I have used three different R packages (base, readr and data.table), and data.table outperformed the other two significantly.

To conclude, storing data in feather format will yield performance statistics. If you want to save on storage space, use RData or rds, these two will save the most space. On the other side, if you are stuck with CSV file format, use data.table, as it is by far the fastest for data wrangling and data manipulation tasks.

As always, complete code is available at Github, in the benchmark repository. In the same repository, you will find also python code from the previous blog post, but if you are looking only for code, that is described in this blog post, click here. Follow up for more updates.

Happy R coding! And stay healthy!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Comparing performances of CSV to RDS, Parquet, and Feather file formats in R

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)