R⁶ — Reticulating Parquet Files

August 1, 2017
By

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

The reticulate package provides a very clean & concise interface bridge between R and Python which makes it handy to work with modules that have yet to be ported to R (going native is always better when you can do it). This post shows how to use reticulate to create parquet files directly from R using reticulate as a bridge to the pyarrow module, which has the ability to natively create parquet files.

Now, you can create parquet files through R with Apache Drill — and, I’ll provide another example for that here — but, you may have need to generate such files and not have the ability to run Drill.

The Python parquet process is pretty simple since you can convert a pandas DataFrame directly to a pyarrow Table which can be written out in parquet format with pyarrow.parquet. We just need to follow this process through reticulate in R:

library(reticulate)

pd <- import("pandas", "pd")
pa <- import("pyarrow", "pa")
pq <- import("pyarrow.parquet", "pq")

mtcars_py <- r_to_py(mtcars)
mtcars_df <- pd$DataFrame$from_dict(mtcars_py)
mtcars_tab <- pa$Table$from_pandas(mtcars_df)

pq$write_table(mtcars_tab, path.expand("~/Data/mtcars_python.parquet"))

I wouldn’t want to do that for ginormous data frames, but it should work pretty well for modest use cases (you’re likely using Spark, Drill, Presto or other “big data” platforms for creation of larger parquet structures). Here’s how we’d do that with Drill via the sergeant package:

readr::write_csv(mtcars, "~/Data/mtcars_r.csvh")
dc <- drill_connection("localhost")
drill_query(dc, "CREATE TABLE dfs.tmp.`/mtcars_r.parquet` AS SELECT * FROM dfs.root.`/Users/bob/Data/mtcars_r.csvh`")

Without additional configuration parameters, the reticulated-Python version (above) generates larger parquet files and also has an index column since they’re needed in Python DataFrames (ugh), but small-ish data frames will end up in a single file whereas the Drill created ones will be in a directory with an additional CRC file (and, much smaller by default). NOTE: You can use preserve_index=False on the call to Table.from_pandas to get rid of that icky index.

It’s fairly efficient even for something like nycflights13::flights which has ~330K rows and 19 columns:

system.time(
  r_to_py(nycflights13::flights) %>% 
  pd$DataFrame$from_dict() %>% 
  pa$Table$from_pandas() %>% 
  pq$write_table(where = "/tmp/flights.parquet")
)
##    user  system elapsed 
##   1.285   0.108   1.398 

If you need to generate parquet files in a pinch, reticulate seems to be a good way to go.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)