Reading PCAP Files with Apache Drill and the sergeant R Package

July 27, 2017

(This article was first published on R –, and kindly contributed to R-bloggers)

It’s no secret that I’m a fan of Apache Drill. One big strength of the platform is that it normalizes the access to diverse data sources down to ANSI SQL calls, which means that I can pull data from parquet, Hie, HBase, Kudu, CSV, JSON, MongoDB and MariaDB with the same SQL syntax. This also means that I get access to all those platforms in R centrally through the sergeant package that rests atop d[b]plyr. However, it further means that when support for a new file type is added, I get that same functionality without any extra effort.

Why am I calling this out?

Well, the intrepid Drill developers are in the process of finalizing the release candidate for version 1.11.0 and one feature they’ve added is the ability to query individual and entire directories full of PCAP files from within Drill. While I provided a link to the Wikipedia article on PCAP files, the TL;DR on them is that it’s an optimized binary file format for recording network activity. If you’re on macOS or a linux-ish system go do something like this:

sudo tcpdump -ni en0 -s0 -w capture01.pcap

And, wait a bit.

NOTE: Some of you may have to change the en0 to your main network interface name (a quick google for that for your platform should get you to the right one to use).

That command will passively record all network activity on your system until you ctrl-c it. The longer it goes the larger it gets.

When you’ve recorded a minute or two of packets, ctrl-c the program and then try to look at the PCAP file. It’s a binary mess. You can re-read it with tcpdump or Wireshark and there are many C[++] libraries and other utilities that can read them. You can even convert them to CSV or XML, but the PCAP itself requires custom tools to work with them effectively. I had started creating crafter to work with these files but my use case/project dried up and haven’t gone back to it.

Adding the capability into Drill means I don’t really have to work any further on that specialized package as I can do this:


db <- src_drill("localhost")

my_pcaps <- tbl(db, "dfs.caps.`/capture02.pcap`")

## Observations: 25
## Variables: 12
## $ src_ip           "", "", "192.168.10...
## $ src_port         60025, 443, 60025, 443, 60025, 58976, 443, 535...
## $ tcp_session      -2.082796e+17, -2.082796e+17, -2.082796e+17, -...
## $ packet_length    129, 129, 66, 703, 66, 65, 75, 364, 65, 65, 75...
## $ data             "...g9B..c.<..O..@=,0R.`........K..EzYd=.........
## $ src_mac_address  "78:4F:43:77:02:00", "D4:8C:B5:C9:6C:1B", "78:...
## $ dst_port         443, 60025, 443, 60025, 443, 443, 58976, 5353,...
## $ type             "TCP", "TCP", "TCP", "TCP", "TCP", "UDP", "UDP...
## $ dst_ip           "", "", "54.159.166...
## $ dst_mac_address  "D4:8C:B5:C9:6C:1B", "78:4F:43:77:02:00", "D4:...
## $ network          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ timestamp        2017-07-27 23:54:58, 2017-07-27 23:54:59, 201...

summarise(my_pcaps, max = max(timestamp), min = min(timestamp)) %>% 
  collect() %>% 
  summarise(max - min)
## # A tibble: 1 x 1
##     `max - min`

No custom R code. No modification to the sergeant package. Just query it like any other data source.

One really cool part of this is that — while similar functionality has been available in various Hadoop contexts for a few years — we’re doing this query from a local file system outside of a Hadoop context.

I had to add "pcap": { "type": "pcap" } to the formats section of the dfs storage configuration (#ty to the Drill community for helping me figure that out) and, I setup a directory that defaults to the pcap type. But after that, it just works.

Well, kinda.

The Java code that the plugin is based on doesn’t like busted PCAP files (which we get quite a bit of in infosec- & honeypot-lands) and it seems to bork on IPv6 packets a bit. And, my sergeant package (for now) can’t do much with the data component (neither can Drill-proper, either). But, it’s a great start and I can use it to do bulk parquet file creation of basic protocols & connection information or take a quick look at some honeypot captures whenever I need to, right from R, without converting them first.

Drill 1.11.0 is only at RC0 right now, so some of these issues may be gone by the time the full release is baked. Some fixes may have to wait for 1.12.0. And, much work needs to be done on the UDF-side and sergeant side to help make the data element more useful.

Even with the issues and limitations, this is an amazing new feature that’s been added to an incredibly useful tool and much thanks goes out to the Drill dev team for sneaking this in to 1.11.0.

If you have cause to work with PCAP files, give this a go and see if it helps speed up parts of your workflow.

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)