sergeant : An R Boot Camp for Apache Drill

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently mentioned that I’ve been working on a development version of an Apache Drill R package called sergeant. Here’s a lifted “TLDR” on Drill:

Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.

Drill’s datastore-aware optimizer automatically restructures a query plan to leverage the datastore’s internal processing capabilities. In addition, Drill supports data locality, so it’s a good idea to co-locate Drill and the datastore on the same nodes.

It also supports reading formats such as:

  • Avro
  • [CTP]SV ([C]omma-, [T]ab-, [P]ipe-Separated-Values)
  • Parquet
  • Hadoop Sequence Files

It’s a bit like Spark in that you can run it on a single workstation and scale up to a YUGE cluster. It lacks the ML components of Spark, but it connects to everything without the need to define a schema up front. Said “everything” includes parquet files on local filesystems, so if you need to slice through GBs of parquet data and have a beefy enough Linux workstation (I believe Drill runs on Windows and know it runs on macOS fine, too, but that’s $$$ for a bucket of memory & disk space) you can take advantage of the optimized processing power Drill offers on a single system (while also joining the data with any other data format you can think of). You can also seamlessly move the data to a cluster and barely tweak your code to support said capacity expansion.

Why sergeant?

There’s already an R package on CRAN to work with Drill: DrillR. It’s S4 class-based, has a decent implementation and interfaces with the REST API. However, it sticks httr::verbose() everywhere: https://github.com/cran/DrillR/search?utf8=%E2%9C%93&q=verbose.

The sergeant package interfaces with the REST API as well, but also works with the JDBC driver (the dev version includes the driver with the package, but this will be removed for the eventual CRAN submission) and includes some other niceties around Drill options viewing and setting and some other non-SQL bits. Of note: the REST API version shows an httr progress bar for data downloading and you can wrap the calls with httr::with_verbose(…) if you really like seeing cURL messages.

The other thing sergeant has going for it is a nascent dplyr interface. Presently, this is a hack-ish wrapper around the RJDBC JDBCConnection presented by the Drill JDBC driver. While basic functionality works, I firmly believe Drill needs it’s own DBI driver (like is second-cousin Preso has) to avoid collisions withy any other JDBC connections you might have open, plus more work needs to be done under the covers to deal with quoting properly and exposing more Drill built-in SQL functions.

SQL vs dplyr

For some truly complex data machinations you’re going to want to work at the SQL level and I think it’s important to know SQL if you’re ever going to do data work outside JSON & CSV files just to appreciate how much gnashing of teeth dplyr saves you from. Using SQL for many light-to-medium aggregation tasks that feed data to R can feel like you’re banging rocks together to make fire when you could just be using your R precision welder. What would you rather write:

SELECT  gender ,  marital_status , COUNT(*) AS  n 
FROM  cp.`employee.json` 
GROUP BY  gender ,  marital_status

in a drill-embedded or drill-localhost SQL shell? Or:

library(RJDBC)
library(dplyr)
library(sergeant)

ds <- src_drill("localhost:31010", use_zk=FALSE)

db <- tbl(ds, "cp.`employee.json`") 

count(db, gender, marital_status) %>% collect()

(NOTE: that SQL statement is what ultimately gets sent to Drill from dplyr)

Now, dplyr tbl_df idioms don’t translate 1:1 to all other src_es, but they are much easier on the eyes and more instructive in analysis code (and, I fully admit that said statement is more opinion than fact).

sergeant and dplyr

The src_drill() function uses the JDBC Drill driver and, hence, has an RJDBC dependency. The Presto folks (a “competing” offering to Drill) wrapped a DBI interface around their REST API to facilitate the use of dplyr idioms. I’m not sold on whether I’ll continue with a lightweight DBI wrapper using RJDBC or go the RPresto route, but for now the basic functionality works and changing the back-end implementation should not break anything (much).

You’ve said “parquet” alot…

Yes. Yes, I have. Parquet is a “big data” compressed columnar storage format that is generally used in Hadoop shops. Parquet is different from ‘feather’ (‘feather’ is based on another Apache foundation project: Arrow). Arrow/feather is great for things that fit in memory. Parquet and the idioms that sit on top of it enable having large amounts data available in a cluster for processing with Hadoop / Spark / Drill / Presto (etc). Parquet is great for storing all kinds of data, including log and event data which I have to work with quite a bit and it’s great being able to prototype on a single workstation then move code to hit a production cluster. Plus, it’s super-easy to, say, convert an entire, nested directory tree of daily JSON log files into parquet with Drill:

CREATE TABLE dfs.destination.`source/2016/12/2016_12_source_event_logs.parquet` AS
  SELECT src_ip, dst_ip, src_port, dst_port, event_message, ts 
  FROM dfs.source.`/log/dir/root/2016/12/*/event_log.json`;

Kick the tyres

The REST and JDBC functions are solid (I’ve been using them at work for a while) and the dplyr support has handled some preliminary production work well (though, remember, it’s not fully-baked). There are plenty of examples — including a dplyr::left_join() between parquet and JSON data — in the README and all the exposed functions have documentation.

File an issue with a feature request or bug report.

I expect to have this CRAN-able in January, 2017.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)