Reading PCAP Files with Apache Drill and the sergeant R Package

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s no secret that I’m a fan of Apache Drill. One big strength of the platform is that it normalizes the access to diverse data sources down to ANSI SQL calls, which means that I can pull data from parquet, Hie, HBase, Kudu, CSV, JSON, MongoDB and MariaDB with the same SQL syntax. This also means that I get access to all those platforms in R centrally through the sergeant package that rests atop d[b]plyr. However, it further means that when support for a new file type is added, I get that same functionality without any extra effort.

Why am I calling this out?

Well, the intrepid Drill developers are in the process of finalizing the release candidate for version 1.11.0 and one feature they’ve added is the ability to query individual and entire directories full of PCAP files from within Drill. While I provided a link to the Wikipedia article on PCAP files, the TL;DR on them is that it’s an optimized binary file format for recording network activity. If you’re on macOS or a linux-ish system go do something like this:

sudo tcpdump -ni en0 -s0 -w capture01.pcap

And, wait a bit.

NOTE: Some of you may have to change the en0 to your main network interface name (a quick google for that for your platform should get you to the right one to use).

That command will passively record all network activity on your system until you ctrl-c it. The longer it goes the larger it gets.

When you’ve recorded a minute or two of packets, ctrl-c the program and then try to look at the PCAP file. It’s a binary mess. You can re-read it with tcpdump or Wireshark and there are many C[++] libraries and other utilities that can read them. You can even convert them to CSV or XML, but the PCAP itself requires custom tools to work with them effectively. I had started creating crafter to work with these files but my use case/project dried up and haven’t gone back to it.

Adding the capability into Drill means I don’t really have to work any further on that specialized package as I can do this:

library(sergeant)
library(iptools)
library(tidyverse)
library(cymruservices)

db <- src_drill("localhost")

my_pcaps <- tbl(db, "dfs.caps.`/capture02.pcap`")

glimpse(my_pcaps)
## Observations: 25
## Variables: 12
## $ src_ip          <chr> "192.168.10.100", "54.159.166.81", "192.168.10...
## $ src_port        <int> 60025, 443, 60025, 443, 60025, 58976, 443, 535...
## $ tcp_session     <dbl> -2.082796e+17, -2.082796e+17, -2.082796e+17, -...
## $ packet_length   <int> 129, 129, 66, 703, 66, 65, 75, 364, 65, 65, 75...
## $ data            <chr> "...g9B..c.<..O..@=,0R.`........K..EzYd=.........
## $ src_mac_address <chr> "78:4F:43:77:02:00", "D4:8C:B5:C9:6C:1B", "78:...
## $ dst_port        <int> 443, 60025, 443, 60025, 443, 443, 58976, 5353,...
## $ type            <chr> "TCP", "TCP", "TCP", "TCP", "TCP", "UDP", "UDP...
## $ dst_ip          <chr> "54.159.166.81", "192.168.10.100", "54.159.166...
## $ dst_mac_address <chr> "D4:8C:B5:C9:6C:1B", "78:4F:43:77:02:00", "D4:...
## $ network         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ timestamp       <dttm> 2017-07-27 23:54:58, 2017-07-27 23:54:59, 201...

summarise(my_pcaps, max = max(timestamp), min = min(timestamp)) %>% 
  collect() %>% 
  summarise(max - min)
## # A tibble: 1 x 1
##     `max - min`
##          <time>
## 1 1.924583 mins

count(my_pcaps, type)
## # Source:   lazy query [?? x 2]
## # Database: DrillConnection
##    type     n
##   <chr> <int>
## 1   TCP  4974
## 2   UDP   774

filter(my_pcaps, type=="TCP") %>% 
  count(dst_port, sort=TRUE)
## # Source:     lazy query [?? x 2]
## # Database:   DrillConnection
## # Ordered by: desc(n)
##    dst_port     n
##       <int> <int>
##  1      443  2580
##  2    56202   476
##  3    56229   226
##  4    56147   169
##  5    56215   103
##  6    56143    94
##  7    56085    89
##  8    56203    56
##  9    56205    39
## 10    56209    39
## # ... with more rows

filter(my_pcaps, type=="TCP") %>% 
  count(dst_ip, sort=TRUE) %>% 
  collect() -> dst_ips

filter(dst_ips, !is.na(dst_ip)) %>%
  left_join(ips_in_cidrs(.$dst_ip, c("10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16")),
            by = c("dst_ip"="ips")) %>%
  filter(!in_cidr) %>%
  left_join(distinct(bulk_origin(.$dst_ip), ip, .keep_all=TRUE), c("dst_ip" = "ip")) %>%
  select(dst_ip, n, as_name)
## # A tibble: 37 x 3
##            dst_ip     n                              as_name
##             <chr> <int>                                <chr>
##  1   104.244.42.2   862           TWITTER - Twitter Inc., US
##  2 104.244.46.103   556           TWITTER - Twitter Inc., US
##  3  104.20.60.241   183 CLOUDFLARENET - CloudFlare, Inc., US
##  4     31.13.80.8   160        FACEBOOK - Facebook, Inc., US
##  5  52.218.160.76   100     AMAZON-02 - Amazon.com, Inc., US
##  6  104.20.59.241    79 CLOUDFLARENET - CloudFlare, Inc., US
##  7  52.218.160.92    66     AMAZON-02 - Amazon.com, Inc., US
##  8  199.16.156.81    58           TWITTER - Twitter Inc., US
##  9 104.244.42.193    47           TWITTER - Twitter Inc., US
## 10  52.86.113.212    42    AMAZON-AES - Amazon.com, Inc., US
## # ... with 27 more rows

No custom R code. No modification to the sergeant package. Just query it like any other data source.

One really cool part of this is that — while similar functionality has been available in various Hadoop contexts for a few years — we’re doing this query from a local file system outside of a Hadoop context.

I had to add "pcap": { "type": "pcap" } to the formats section of the dfs storage configuration (#ty to the Drill community for helping me figure that out) and, I setup a directory that defaults to the pcap type. But after that, it just works.

Well, kinda.

The Java code that the plugin is based on doesn’t like busted PCAP files (which we get quite a bit of in infosec- & honeypot-lands) and it seems to bork on IPv6 packets a bit. And, my sergeant package (for now) can’t do much with the data component (neither can Drill-proper, either). But, it’s a great start and I can use it to do bulk parquet file creation of basic protocols & connection information or take a quick look at some honeypot captures whenever I need to, right from R, without converting them first.

Drill 1.11.0 is only at RC0 right now, so some of these issues may be gone by the time the full release is baked. Some fixes may have to wait for 1.12.0. And, much work needs to be done on the UDF-side and sergeant side to help make the data element more useful.

Even with the issues and limitations, this is an amazing new feature that’s been added to an incredibly useful tool and much thanks goes out to the Drill dev team for sneaking this in to 1.11.0.

If you have cause to work with PCAP files, give this a go and see if it helps speed up parts of your workflow.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)