Site icon R-bloggers

How fast does a compressed file in?

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< section id="introduction" class="level1">

Introduction

I received an email over the weekend in regards to my last post not containing the reading in of gz compressed file(s) for the benchmarking. While this was not an over site per-se it was a good reminder that people would probably be interested in seeing this as well.

Benchmarking is the process of measuring and comparing the performance of different programs, tools, or configurations in order to identify which one is the most efficient for a specific task. It is a critical step in software development that can help developers identify performance bottlenecks and improve the overall performance of their applications.

In this post I create a square matrix and then convert it to a data.frame (2,000 rows by 2,000 columns) and then saved it as a gz compressed csv file. The benchmark compares different R packages and functions, including base R, data.table, vroom, and readr, and measures their relative speeds based on the time it takes to read in the .csv.gz file.

Here are some pro’s of trying things different ways and properly benchmarking them:

In conclusion, benchmarking is an essential tool for software developers that can help them identify the most efficient solutions for their applications. By measuring the relative speeds of different programs or tools, developers can optimize resource utilization, avoid premature optimization, keep up with technology, and improve the quality of their code.

< section id="function" class="level1">

Function

The different functions I use in the benchmarking are as follows:

< section id="base-r" class="level2">

Base R

< section id="data.table" class="level2">

data.table

< section id="vroom" class="level2">

vroom

< section id="readr" class="level2">

readr

< section id="example" class="level1">

Example

Let’s make a 2,000 by 2,000 matrix, covert to a data.frame and then save it out as a .csv file and then convert to a .gz file.

library(R.utils)

# create a 1000 x 1000 matrix of random numbers
my_matrix <- matrix(rnorm(2000000), nrow = 2000, ncol = 2000) |>
  as.data.frame()

# Make and save gzipped file
write.csv(my_matrix, "my_matrix.csv")
gzip(filename = "my_matrix.csv", destname = "matrix.csv.gz",
     overwrite = FALSE, remove = TRUE)

Ok now that the data is written we can benchmark the read in times from various packages.

< section id="benchmarking" class="level2">

Benchmarking

library(rbenchmark)
library(data.table)
library(readr)
library(vroom)
library(dplyr)

n <- 30

benchmark(
  # Base R
  "read.table" = {
    a <- read.table("matrix.csv.gz", sep = ",")
  },
  "read.csv" = {
    b <- read.csv("matrix.csv.gz", sep = ",")
  },
  
  # data.table
  "fread" = {
    c <- fread("matrix.csv.gz", sep = ",")
  },
  
  # vroom
  "vroom alltrep false" = {
    d <- vroom("matrix.csv.gz", delim = ",")
  },
  "vroom alltrep true" = {
    e <- vroom("matrix.csv.gz", delim = ",", altrep = TRUE)
  },
  
  # readr
  "readr" = {
    f <- read_csv("matrix.csv.gz")
  },
  
  # Replications
  replications = n,
  
  # Columns
  columns = c(
    "test","replications","elapsed","relative","user.self","sys.self")
) |>
  arrange(relative)
                 test replications elapsed relative user.self sys.self
1               fread           30   19.44    1.000     13.56     1.59
2  vroom alltrep true           30   22.06    1.135     10.54     2.63
3 vroom alltrep false           30   24.75    1.273     10.22     2.84
4          read.table           30   94.34    4.853     79.02     0.64
5            read.csv           30  143.28    7.370    115.64     0.74
6               readr           30  177.61    9.136     50.37    10.05

Voila!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version