Understanding the Parquet file format

[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Apache Parquet is a popular column storage file format used by Hadoop systems, such as Pig, Spark, and Hive. The file format is language independent and has a binary representation. Parquet is used to efficiently store large data sets and has the extension .parquet. This blog post aims to understand how parquet works and the tricks it uses to efficiently store data.

Key features of parquet are:

  • it’s cross platform
  • it’s a recognised file format used by many systems
  • it stores data in a column layout
  • it stores metadata

The latter two points allow for efficient storage and querying of data.

Column Storage

Suppose we have a simple data frame:

tibble::tibble(id = 1:3, 
              name = c("n1", "n2", "n3"), 
              age = c(20, 35, 62))
#> # A tibble: 3 × 3
#>      id name    age
#>   <int> <chr> <dbl>
#> 1     1 n1       20
#> 2     2 n2       35
#> 3     3 n3       62

If we stored this data set as a CSV file, what we see in the R terminal is mirrored in the file storage format. This is row storage. This is efficient for file queries such as,

SELECT * FROM table_name WHERE id == 2

We simply go to the 2nd row and retrieve that data. It’s also very easy to append rows to the data set – we just add a row to the bottom of the file. However, if we want to sum the data in the age column, then this is potentially inefficient. We would need to determine which value on each row is related to age, and extract that value.

Parquet uses column storage. In column layouts, column data are stored sequentially.

1 2 3
n1 n2 n3
20 35 62

With this layout, queries such as

SELECT * FROM dd WHERE id == 2

are now inconvenient. But if we want to sum up all ages, we simply go to the third row and add up the numbers.

Reading and writing parquet files

In R, we read and write parquet files using the {arrow} package.

# install.packages("arrow")
#> [1] '5.0.0'

To create a parquet file, we use write_parquet()

# Use the penguins data set
data(penguins, package = "palmerpenguins")
# Create a temporary file for the output
parquet = tempfile(fileext = ".parquet")
write_parquet(penguins, sink = parquet)

To read the file, we use read_parquet(). One of the benefits of using parquet, is small file sizes. This is important when dealing with large data sets, especially once you start incorporating the cost of cloud storage. Reduced file size is achieved via two methods:

  • File compression. This is specified via the compression argument in write_parquet(). The default is snappy.
  • Clever storage of values (the next section).

Do you use RStudio Pro? If so, checkout our our managed RStudio services

Parquet Encoding

Since parquet uses column storage, values of the same type are number stored together. This opens up a whole world of optimisation tricks that aren’t available when we save data as rows, e.g. CSV files.

Run length encoding

Suppose a column just contains a single value repeated on every row. Instead of storing the same number over and over (as a CSV file would), we can just record “value X repeated N times”. This means that even when N gets very large, the storage costs remain small. If we had more than one value in a column, then we can use a simple look-up table. In parquet, this is known as run length encoding. If we have the following column

c(4, 4, 4, 4, 4, 1, 2, 2, 2, 2)
#>  [1] 4 4 4 4 4 1 2 2 2 2

This would be stored as

  • value 4, repeated 5 times
  • value 1, repeated once
  • value 2, reported 4 times

To see this in action, lets create a simple example, where the character A is repeated multiple times in a data frame column:

x = data.frame(x = rep("A", 1e6))

We can then create a couple of temporary files for our experiment

parquet = tempfile(fileext = ".parquet")
csv = tempfile(fileext = ".csv")

and write the data to the files

arrow::write_parquet(x, sink = parquet, compression = "uncompressed")
readr::write_csv(x, file = csv)

Using the {fs} package, we extract the size

# Could also use file.info()
fs::file_info(c(parquet, csv))[, "size"]
#> # A tibble: 2 × 1
#>          size
#>   <fs::bytes>
#> 1        1015
#> 2       1.91M

We see that the parquet file is tiny, whereas the CSV file is almost 2MB. This is actually a 500 fold reduction in file space.

Dictionary encoding

Suppose we had the following character vector

c("Jumping Rivers", "Jumping Rivers", "Jumping Rivers")
#> [1] "Jumping Rivers" "Jumping Rivers" "Jumping Rivers"

If we want to save storage, then we could replace Jumping Rivers with the number 0 and have a table to map between 0 and Jumping Rivers. This would significantly reduce storage, especially for long vectors.

x = data.frame(x = rep("Jumping Rivers", 1e6))
arrow::write_parquet(x, sink = parquet)
readr::write_csv(x, file = csv)
fs::file_info(c(parquet, csv))[, "size"]
#> # A tibble: 2 × 1
#>          size
#>   <fs::bytes>
#> 1       1.09K
#> 2      14.31M

Delta encoding

This encoding is typically used in conjunction with timestamps. Times are typically stored as Unix times, which is the number of seconds that have elapsed since January 1st, 1970. This storage format isn’t particularly helpful for humans, so typically it is pretty-printed to make it more palatable for us. For example,

(time = Sys.time())
#> [1] "2021-09-21 17:05:08 BST"
#> [1] 1632240309

If we have a large number of time stamps in a column, one method for reducing file size is to simply subtract the minimum time stamp from all values. For example, instead of storing

c(1628426074, 1628426078, 1628426080)
#> [1] 1628426074 1628426078 1628426080

we would store

c(0, 4, 6)
#> [1] 0 4 6

with the corresponding offset 1628426074.

Other encodings

There are a few other tricks that parquet uses. Their GitHub page gives a complete overview.

If you have a parquet file, you can use parquet-mr to investigate the encoding used within a file. However, installing the tool isn’t trivial and does take some time.

Feather vs Parquet

The obvious question that comes to mind when discussing parquet, is how does it compare to the feather format. Feather is optimised for speed, whereas parquet is optimised for storage. It’s also worth noting, that the Apache Arrow file format is feather.

Parquet vs RDS Formats

The RDS file format used by readRDS()/saveRDS() and load()/save(). It is file format native to R and can only be read by R. The main benefit of using RDS is that it can store any R object – environments, lists, and functions.

If we are solely interested in rectangular data structures, e.g. data frames, then reasons for using RDS files are

  • the file format has been around for a long time and isn’t likely to change. This means it is backwards compatible
  • it doesn’t depend on any external packages; just base R.

The advantages of using parquet are

  • the file size of parquet files are slightly smaller. If you want to compare file sizes, make sure you set compression = "gzip" in write_parquet() for a fair comparison.
  • parquet files are cross platform
  • in my experiments, parquet files, as you would expect, are slightly smaller. For some use cases, an additional saving of 5% may be worth it. But, as always, it depends on your particular use cases.


For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)