Converting arbitrarily large CSVs to Parquet with R

Posted on March 30, 2025 by Michael Mayer in R bloggers | 0 Comments

[This article was first published on R – Michael's and Christian's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this recent post, we have used Polars and DuckDB to convert a large CSV file to Parquet in steaming mode – and Python.

Different people have contacted me and asked: “and in R?”

Simple answer: We have DuckDB, and we have different Polars bindings. Here, we are using {polars} which is currently being overhauled into {neopandas}.

So let’s not wait any longer!

Run times are on a Windows system with an Intel i7-13700H CPU.

Generate 2.2 GB csv file

We use {data.table} to dump a randomly generated dataset with 100 Mio rows into a csv file.

library(data.table)

set.seed(1)

n <- 1e8

df <- data.frame(
  X = sample(letters[1:3], n, TRUE),
  Y = runif(n),
  Z = sample(1:5, n, TRUE)
)

fwrite(df, "data.csv")

DuckDB

Then, we use DuckDB to fire a query to the file and stream the result into Parquet.

Threads and RAM can be set on the fly, which is very convenient. Setting a low memory limit (e.g., 500 MB) will work – try it out!

library(duckdb)

con <- dbConnect(duckdb(config = list(threads = "8", memory_limit = "4GB")))

system.time( # 3.5s
  dbSendQuery(
    con,
    "
    COPY (
      SELECT Y, Z
      FROM 'data.csv'
      WHERE X == 'a'
      ORDER BY Y
    ) TO 'data.parquet' (FORMAT parquet, COMPRESSION zstd)
    "
  )
)

# Check
dbGetQuery(con, "SELECT COUNT(*) N FROM 'data.parquet'") # 33329488
dbGetQuery(con, "SELECT * FROM 'data.parquet' LIMIT 5")
#              Y Z
# 1 5.355105e-09 4
# 2 9.080395e-09 5
# 3 2.258457e-08 2
# 4 3.445894e-08 2
# 5 6.891787e-08 1

3.5 seconds – wow! The resulting file looks good. It is 125 MB large.

Polars

Let’s do the same with Polars.

# Sys.setenv(NOT_CRAN = "true")
# install.packages("polars", repos = "https://community.r-multiverse.org")
library(polars)

polars_info()

system.time( # 9s
  (
    pl$scan_csv("data.csv")
    $filter(pl$col("X") == "a")
    $drop("X")
    $sort("Y")
    $sink_parquet("data.parquet", row_group_size = 1e5)
  )
)

# Check
pl$scan_parquet("data.parquet")$head()$collect()
# shape: (5, 2)
# ┌───────────┬─────┐
# │ Y         ┆ Z   │
# │ ---       ┆ --- │
# │ f64       ┆ i64 │
# ╞═══════════╪═════╡
# │ 5.3551e-9 ┆ 4   │
# │ 9.0804e-9 ┆ 5   │
# │ 2.2585e-8 ┆ 2   │
# │ 3.4459e-8 ┆ 2   │
# │ 6.8918e-8 ┆ 1   │
# └───────────┴─────┘

With nine seconds, it is slower than DuckDB. But the output looks as expected and has the same size as with DuckDB.

Final words

With DuckDB or Polars, conversion of CSVs to Parquet is easy and fast, even in larger-than-RAM situations.
We can apply filters, selects, sorts etc. on the fly.
Let’s keep an eye on Polars in R. It looks really interesting.

R script

To leave a comment for the author, please follow the link and comment on their blog: R – Michael's and Christian's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Converting arbitrarily large CSVs to Parquet with R

Generate 2.2 GB csv file

DuckDB

Polars

Final words

Related

Generate 2.2 GB csv file

DuckDB

Polars

Final words

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)