Feather: fast, interoperable data import/export for R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Unlike most other statistical software packages, R doesn't have a native data file format. You can certainly import and export data in any number of formats, but there's no native “R data file format”. The closest equivalent is the saveRDS
/loadRDS
function pair, which allows you to serialize an R object to a file and then load it back into a later R session. But these files don't hew to a standardized format (it's essentially a dump of R in-memory representation of the object), and so you can't read the data with any software other than R.
The goal of the feather project, a collaboration of Wes McKinney and Hadley Wickham, is to create a standard data file format that can be used for data exchange by and between R, Python, and any other software that implements its open-source format. Data are stored in a computer-native binary format, which makes the files small (a 10-digit integer takes just 4 bytes, instead of the 10 ASCII characters required by a CSV file), and fast to read and write (no need to convert numbers to text and back again). Another reason why feather is fast is that it's a column-oriented file format, which matches R's internal representation of data. (In fact, feather is based on the Apache Arrow framework for working with columnar data stores.) When reading or writing traditional data files with R, it must spend signfican time translating the data from column format to row format and back again; with feather the entire second step in the process below is eliminated.
For users of R 3.3.0 and later, the feather package is now available on CRAN. (Users of older versions of R can install feather from GitHub.) With feather installed, you can read and write R data frames to feather files using simple functions:
write_feather(mtcars. "mtcars.feather")
mtcars2 <- read_feather("mtcars.feather")
Better yet, the mtcars.feather file can easily be read into Python, using its feather-format package. This example uses the small built-in mtcars data frame, but you should see a significant performance impact when working with larger data. Eduardo Ariño de la Rubia performed some benchmarking of feather, and found it to be significantly faster for ingesting data than other popular R functions. The chart below compares using feather, the data.table package, and loadRDS
to import 508Mb file of 8.5 million rows and 7 columns:
Feather wasn't the fastest function benchmarked for writing data — data.table's fwrite
function generally performed a bit better — but given that you typically read a file more often than writing it, the speedups should be very noticable in day-to-day data science activites.
For more on the feather package, check out its announcement from the RStudio blog linked below.
RStudio blog: Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.