reticulate package provides a very clean & concise interface bridge between R and Python which makes it handy to work with modules that have yet to be ported to R (going native is always better when you can do it). This post shows how to use
reticulate to create parquet files directly from R using
reticulate as a bridge to the
pyarrow module, which has the ability to natively create parquet files.
Now, you can create parquet files through R with Apache Drill — and, I’ll provide another example for that here — but, you may have need to generate such files and not have the ability to run Drill.
The Python parquet process is pretty simple since you can convert a
DataFrame directly to a
Table which can be written out in parquet format with
pyarrow.parquet. We just need to follow this process through
reticulate in R:
library(reticulate) pd <- import("pandas", "pd") pa <- import("pyarrow", "pa") pq <- import("pyarrow.parquet", "pq") mtcars_py <- r_to_py(mtcars) mtcars_df <- pd$DataFrame$from_dict(mtcars_py) mtcars_tab <- pa$Table$from_pandas(mtcars_df) pq$write_table(mtcars_tab, path.expand("~/Data/mtcars_python.parquet"))
I wouldn’t want to do that for ginormous data frames, but it should work pretty well for modest use cases (you’re likely using Spark, Drill, Presto or other “big data” platforms for creation of larger parquet structures). Here’s how we’d do that with Drill via the
readr::write_csv(mtcars, "~/Data/mtcars_r.csvh") dc <- drill_connection("localhost") drill_query(dc, "CREATE TABLE dfs.tmp.`/mtcars_r.parquet` AS SELECT * FROM dfs.root.`/Users/bob/Data/mtcars_r.csvh`")
Without additional configuration parameters, the reticulated-Python version (above) generates larger parquet files and also has an index column since they’re needed in Python
DataFrames (ugh), but small-ish data frames will end up in a single file whereas the Drill created ones will be in a directory with an additional CRC file (and, much smaller by default). NOTE: You can use
preserve_index=False on the call to
Table.from_pandas to get rid of that icky index.
It’s fairly efficient even for something like
nycflights13::flights which has ~330K rows and 19 columns:
system.time( r_to_py(nycflights13::flights) %>% pd$DataFrame$from_dict() %>% pa$Table$from_pandas() %>% pq$write_table(where = "/tmp/flights.parquet") ) ## user system elapsed ## 1.285 0.108 1.398
If you need to generate parquet files in a pinch,
reticulate seems to be a good way to go.