Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow

March 29, 2016
By

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Wes McKinney, Software Engineer, Cloudera
Hadley Wickham, Chief Scientist, RStudio

This past January, we (Hadley and Wes) met and discussed some of the systems challenges facing the Python and R open source communities. In particular, we wanted to see if there were some opportunities to collaborate on tools for improving interoperability between Python, R, and external compute and storage systems.

One thing that struck us was that while R’s data frames and Python’s pandas data frames utilize very different internal memory representations, they share a very similar semantic model. In both R and Panda’s, data frames are lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (factors), or string. Every column can have missing values.

Around this time, the open source community had just started the new Apache Arrow project, designed to improve data interoperability for systems dealing with columnar tabular data.

In discussing Apache Arrow in the context of Python and R, we wanted to see if we could use the insights from feather to design a very fast file format for storing data frames that could be used by both languages. Thus, the Feather format was born.

What is Feather?

Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:

  • Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible
  • Language agnostic: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too.
  • High read and write performance. When possible, Feather operations should be bound by local disk performance.

Code examples

The Feather API is designed to make reading and writing data frames as easy as possible. In R, the code might look like:

library(feather)
path <- "my_data.feather"
write_feather(df, path)
df <- read_feather(path)

Analogously, in Python, we have:

import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)
df = feather.read_dataframe(path)

How fast is Feather?

Feather is extremely fast. Since Feather does not currently use any compression internally, it works best when used with solid-state drives as come with most of today’s laptop computers. For this first release, we prioritized a simple implementation and are thus writing unmodified Arrow memory to disk.

To give you an idea, here is a Python benchmark writing an approximately 800MB pandas DataFrame to disk:

import feather
import pandas as pd
import numpy as np
arr = np.random.randn(10000000) # 10% nulls
arr[::10] = np.nan
df = pd.DataFrame({'column_{0}'.format(i): arr for i in range(10)})
feather.write_dataframe(df, 'test.feather')

On Wes’s laptop (latest-gen Intel processor with SSD), this takes:

In [9]: %time df = feather.read_dataframe('test.feather')
CPU times: user 316 ms, sys: 944 ms, total: 1.26 s
Wall time: 1.26 s

In [11]: 800 / 1.26
Out[11]: 634.9206349206349

This is effective performance of over 600 MB/s. Of course, the performance you see will depend on your hardware configuration.

And in R (on Hadley’s laptop, which is very similar):

library(feather)

x <- runif(1e7)
x[sample(1e7, 1e6)] <- NA # 10% NAs
df <- as.data.frame(replicate(10, x))
write_feather(df, 'test.feather')

system.time(read_feather('test.feather'))
#>   user  system elapsed 
#>  0.731   0.287   1.020 

How can I get Feather?

The Feather source code is hosted at http://github.com/wesm/feather.

Installing Feather for R

Feather is currently available from github, and you can install with:

devtools::install_github("wesm/feather/R")

Feather uses C++11, so if you’re on windows, you’ll need the new gcc 4.93 toolchain. (All going well this will be included in R 3.3.0, which is scheduled for release on April 14. We’ll aim for a CRAN release soon after that).

Installing Feather for Python

For Python, you can install Feather from PyPI like so:

$ pip install feather-format

We will look into providing more installation options, such as conda builds, in the future.

What should you not use Feather for?

Feather is not designed for long-term data storage. At this time, we do not guarantee that the file format will be stable between versions. Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis.

Feather, Apache Arrow, and the community

One of the great parts of Feather is that the file format is language agnostic. Other languages, such as Julia or Scala (for Spark users), can read and write the format without knowledge of details of Python or R.

Feather is one of the first projects to bring the tangible benefits of the Arrow spec to users in the form of an efficient, language-agnostic representation of tabular data on disk. Since Arrow does not provide for a file format, we are using Google’s Flatbuffers library (github.com/google/flatbuffers) to serialize column types and related metadata in a language-independent way in the file.

The Python interface uses Cython to expose Feather’s C++11 core to users, while the R interface uses Rcpp for the same task.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Mango solutions



plotly webpage

dominolab webpage



Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training

datasociety

http://www.eoda.de





ODSC

ODSC

CRC R books series





Six Sigma Online Training









Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)