Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The insanely productive elf-lord, @quominus put together a small package (triebeard) that exposes an API for radix/prefix tries at both the R and Rcpp levels. I know he had some personal needs for this and we both kinda need these to augment some functions in our iptools package. Despite triebeard having both a vignette and function-level examples, I thought it might be good to show a real-world use of the package (at least in the cyber real world): fast determination of which autonomous system an IPv4 address is in (if it’s in one at all).

I’m not going to delve to deep into routing (you can find a good primer here and one that puts routing in the context of radix tries here) but there exists, essentially, abbreviated tables of which IP addresses belong to a particular network. These tables are in routers on your local networks and across the internet. Groups of these networks (on the internet) are composed into those autonomous systems I mentioned earlier and these tables are used to get the packets that make up the cat videos you watch routed to you as efficiently as possible.

When dealing with cybersecurity data science, it’s often useful to know which autonomous system an IP address belongs in. The world is indeed full of peril and in it there are many dark places. It’s a dangerous business, going out on the internet and we sometimes find it possible to identify unusually malicious autonomous systems by looking up suspicious IP addresses en masse. These mappings look something like this:

CIDR            ASN
1.0.0.0/24      47872
1.0.4.0/24      56203
1.0.5.0/24      56203
1.0.6.0/24      56203
1.0.7.0/24      38803
1.0.48.0/20     49597
1.0.64.0/18     18144

Each CIDR has a start and end IP address which can ultimately be converted to integers. Now, one could just sequentially compare start and end ranges to see which CIDR an IP address belongs in, but there are (as of the day of this post) 647,563 CIDRs to compare against, which—in the worst case—would mean having to traverse through the entire list to find the match (or discover there is no match). There are some trivial ways to slightly optimize this, but the search times could still be fairly long, especially when you’re trying to match a billion IPv4 addresses to ASNs.

By storing the CIDR mask (the number of bits of the leading IP address specified after the /) in binary form (strings of 1’s and 0’s) as keys for the trie, we get much faster lookups (only a few comparisons at worst-case vs 647,563).

I made an initial, naïve, mostly straight R, implementation as a precursor to a more low-level implementation in Rcpp in our iptools package and to illustrate this use of the triebeard package.

One thing we’ll need is a function to convert an IPv4 address (in long integer form) into a binary character string. We could do this with base R, but it’ll be super-slow and it doesn’t take much effort to create it with an Rcpp inline function:

library(Rcpp)
library(inline)

ip_to_binary_string <- rcpp(signature(x="integer"), "
NumericVector xx(x);

std::vector<double> X(xx.begin(),xx.end());
std::vector<std::string> output(X.size());

for (unsigned int i=0; i<X.size(); i++){

if ((i % 10000) == 0) Rcpp::checkUserInterrupt();

output[i] = std::bitset<32>(X[i]).to_string();

}

return(Rcpp::wrap(output));
")

ip_to_binary_string(ip_to_numeric("192.168.1.1"))
## [1] "11000000101010000000000100000001"

We take a vector from R and use some C++ standard library functions to convert them to bits. I vectorized this in C++ for speed (which is just a fancy way to say I used a for loop). In this case, our short cut will not make for a long delay.

Now, we’ll need a CIDR file. There are historical ones avaialble, and I use one that I generated the day of this post (and, referenced in the code block below). You can use pyasn to make new ones daily (relegating mindless, automated, menial data retrieval tasks to the python goblins, like one should).

library(iptools)
library(stringi)
library(dplyr)
library(purrr)
library(tidyr)

asn_dat_url <- "http://rud.is/dl/asn-20160712.1600.dat.gz"
asn_dat_fil <- basename(asn_dat_url)

rip <- read_tsv(asn_dat_fil, comment=";", col_names=c("cidr", "asn"))
rip %>%

rip_df
## # A tibble: 647,557 x 4
##        <chr> <chr> <int>                    <chr>
## 1    1.0.0.0    24 47872 000000010000000000000000
## 2    1.0.4.0    24 56203 000000010000000000000100
## 3    1.0.5.0    24 56203 000000010000000000000101
## 4    1.0.6.0    24 56203 000000010000000000000110
## 5    1.0.7.0    24 38803 000000010000000000000111
## 6   1.0.48.0    20 49597     00000001000000000011
## 7   1.0.64.0    18 18144       000000010000000001
## 8  1.0.128.0    17  9737        00000001000000001
## 9  1.0.128.0    18  9737       000000010000000010
## 10 1.0.128.0    19  9737      0000000100000000100
## # ... with 647,547 more rows

You can save off that data_frame to an R data file to pull in later (but it’s pretty fast to regenerate).

Now, we create the trie, using the prefix we calculated and a value we’ll piece together for this example:

library(triebeard)

rip_trie <- trie(rip_df$prefix, sprintf("%s/%s|%s", rip_df$ip, rip_df$mask, rip_df$asn))

Yep, that’s it. If you ran this yourself, it should have taken less than 2s on most modern systems to create the nigh 700,000 element trie.

Now, we’ll generate a million random IP addresses and look them up:

set.seed(1492)
data_frame(lkp=ip_random(1000000),
lkp_bin=ip_to_binary_string(ip_to_numeric(lkp)),
long=longest_match(rip_trie, lkp_bin)) -> lkp_df

lkp_df
## # A tibble: 1,000,000 x 3
##               lkp                          lkp_bin                long
##             <chr>                            <chr>               <chr>
## 1   35.251.195.57 00100011111110111100001100111001  35.248.0.0/13|4323
## 2     28.57.78.42 00011100001110010100111000101010                <NA>
## 3   24.60.146.202 00011000001111001001001011001010   24.60.0.0/14|7922
## 4    14.236.36.53 00001110111011000010010000110101                <NA>
## 5   7.146.253.182 00000111100100101111110110110110                <NA>
## 6     2.9.228.172 00000010000010011110010010101100     2.9.0.0/16|3215
## 7  108.111.124.79 01101100011011110111110001001111 108.111.0.0/16|3651
## 8    65.78.24.214 01000001010011100001100011010110   65.78.0.0/19|6079
## 9   50.48.151.239 00110010001100001001011111101111   50.48.0.0/13|5650
## 10  97.231.13.131 01100001111001110000110110000011   97.128.0.0/9|6167
## # ... with 999,990 more rows

On most modern systems, that should have taken less than 3s.

The NA values are not busted lookups. Many IP networks are assigned but not accessible (see this for more info). You can validate this with cymruservices::bulk_origin() on your own, too).

The trie structure for these CIDRs takes up approximately 9MB of RAM, a small price to pay for speedy lookups (and, memory really is not what the heart desires, anyway). Hopefully the triebeard package will help you speed up your own lookups and stay-tuned for a new version of iptools with some new and enhanced functions.