we have created Azure blob storage, connected secure connection using Python and started uploading files to blob store from SQL Server. Alongside, we compared the performance of different file types. ORC, AVRO, Parquet, CSV and Feather. Coming to conclusion, CSV is great for its readability, but not suitable (as a file format) for all types of workloads.
We will be doing a similar benchmark with R language. The goal is to see, if CSV file format can be replaced by a file type that better, both in performance and storage.
Creating a random dataset (10K, 1MIO rows with 30 columns) we have tested the speed of writing and speed of reading from local machine. For the benchmark, we have used package microbenchmark. File formats were: CSV, Feather, Parquet and Rdata (or RDS).
This is the result of running 10 times each operation (writing and reading) for the 10K rows.
The test was performed with R version 4.1.2 and M1 MacOS. For the 10.000 rows (with 30 columns), I have the average CSV size 3,3MiB and Feather and Parquet circa 1,4MiB, and less than 1MiB for RData and rds R format. By far the best reads and write times the tests show a big advantage with Feather file format, following with Parquet and data.table package (reading from CSV file).
Test was conducted using a sample dataframe:
#file names file_csv <- 'test_df.csv' file_csv_readr <- 'test_df_readr.csv' file_csv_datatable <- 'test_df_datatable.csv' file_feather <- 'test_df.feather' file_rdata <- 'test_df.RData' file_rds <- 'test_df.rds' file_parquet <- 'test_df.parquet' files <- file.info(c('test_df.csv','test_df_readr.csv','test_df_datatable.csv', 'test_df.feather', 'test_df.RData', 'test_df.rds', 'test_df.parquet')) test_df <- data.frame( replicate(10, sample(0:10000, nof_rows, rep = TRUE)), replicate(10, runif(nof_rows, 0.0, 1.0)), replicate(10, stri_rand_strings(1000, 10)) )
and running benchmark:
benchmark_write <- data.frame(summary(microbenchmark( "test_df.csv" = write.csv(test_df, file = file_csv), "test_df_readr.csv" = readr::write_csv(test_df, file = file_csv_readr), "test_df_datatable.csv" = data.table::fwrite(test_df, file = file_csv_datatable), "test_df.feather" = write_feather(test_df, file_feather), "test_df.parquet" = write_parquet(test_df, file_parquet), "test_df.rds" = save(test_df, file = file_rdata), "test_df.RData" = saveRDS(test_df, file_rds), times = nof_repeat)))
Same was repeated for reading operation and results were merged:
benchmark_read <- data.frame(summary(microbenchmark( "test_df.csv" = read.csv(file_csv), "test_df_readr.csv" = readr::read_csv(file_csv_readr), "test_df_datatable.csv" = data.table::fread(file_csv_datatable), "test_df.rds" = load(file_rdata), "test_df.RData" = readRDS(file_rds), "test_df.feather" = arrow::read_feather(file_feather), "test_df.parquet" = read_parquet(file_parquet), times = nof_repeat))) colnames(benchmark_read) <- c("names", "read_min", "read_lq", "read_mean", "read_median", "read_uq", "read_max", "read_repeat") #merge results and create factors results <- inner_join(inner_join(benchmark_read, files, by = "names"), benchmark_write, by = "names") results <- results[,c("names","size_mb","read_min", "read_max", "read_median","write_min","write_max", "write_median")]
And finally, have plotted the results for the graph above:
ibrary(glue) library(ggtext) title_lab_adjusted <- glue::glue( "File types comparison on<br><span style = 'color:red;'>read operation</span> and <br><span style='color:darkgreen';>write operation</span>") ggplot(results, aes(x=names, y=size_mb)) + geom_bar(stat="identity", fill="lightblue") + geom_text(aes(label=paste0(format(round(size_mb, 2), nsmall = 2), " MiB", collapse=NULL)), vjust=-0.3, size=3.5)+ theme(axis.text.x = element_text(angle = 45, hjust = 1.3)) + coord_cartesian(ylim = c(0, 5), expand = F) + scale_y_continuous(breaks = seq(0, 5, 1),labels = scales::label_comma(accuracy = 1)) + theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line(size = 0.1, linetype = "solid", colour = "grey50")) + ylab(label = 'Time (sec.) + File_Size') + xlab(label = 'Files') + labs(title = title_lab_adjusted) + theme( plot.title = element_markdown(), panel.background = element_rect(color = NA, fill = 'white')) + geom_point (aes(y=write_median/100, group=names), col = "darkgreen", size = 2, stat ="identity", alpha=.8) + geom_point(aes(y=read_median/100, group=names), col = "red", size = 2, stat ="identity", alpha=.8 )
To do a little bit more stress on the CPU, I have created a data.frame with 1Mio rows.
When comparing file size, we can see that CSV files are all around 330MiB in size, RData and RDS around 80MiB and Feather/Parquet around 140MiB. So CSV is already 2x bigger than Parquet and 4x bigger than RData file.
Looking into performance (median for write/read), we can see Feather is by far the most efficient file format. out of 10 runs, reading the complete dataset (1Mio rows), took only 0.4 seconds, whereas CSV was lagging behind significantly. With write operations, Feather’s file format was also the fastest.
Exploring the CSV file format, I have used three different R packages (base, readr and data.table), and data.table outperformed the other two significantly.
To conclude, storing data in feather format will yield performance statistics. If you want to save on storage space, use RData or rds, these two will save the most space. On the other side, if you are stuck with CSV file format, use data.table, as it is by far the fastest for data wrangling and data manipulation tasks.
As always, complete code is available at Github, in the benchmark repository. In the same repository, you will find also python code from the previous blog post, but if you are looking only for code, that is described in this blog post, click here. Follow up for more updates.
Happy R coding! And stay healthy!