Continuing the blog’s UDF theme of late, there are two new UDF kids in town:
drill-url-toolsfor slicing & dicing URI/URLs (just going to use ‘URL’ from now on in the post)
drill-domain-toolsfor slicing & dicing internet domain names (IDNs).
Now, if you’re an Apache Drill fanatic, you’re likely thinking “Hey hrbrmstr: don’t you know that Drill has a
parse_url() function already?” My answer is “Sure, but it’s based on
java.net.URL which is fundamentally broken.”
Slicing & dicing URLs and IDNs is a large part of the
$DAYJOB and they go together pretty well, hence the joint UDF release.
Rather than just use boring SQL for an example, we’ll start with some SQL and use R for a decent example of working with the two, new UDFs.
Counting Lying Lock Icons
SSL/TLS is all the craze these days, so let’s see how many distinct sites in the GDELT Global Front Page (GFG) data set use port 443 vs port 80 (a good indicator, plus it will help show how the URL tools pick up ports even when they’re not there).
If you go to the aforementioned URL it instructs us that the most current GFG dataset URL can be retrieved by inspecting the contents of this metadata URL
There are over a million records in that data set but — as we’ll see — not nearly as many distinct hosts.
Let’s get the data:
library(sergeant) library(tidyverse) read_delim( file = "http://data.gdeltproject.org/gdeltv3/gfg/alpha/lastupdate.txt", delim = " ", col_names = FALSE, col_types = "ccc" ) -> gfg_update dl_path
Those operations have placed the GFG data set in a place where my local Drill instance can get to them. It’s a tab separated file (TSV) which — while not a great data format — is workable with Drill.
Now we’ll setup a SQL query that will parse the URLs and domains, giving us a nice rectangular structure for R &
dbplyr. We’ll use the second column since a significant percentage of the URLs in column 6 are malformed:
db gfg_df gfg_df ## # Database: DrillConnection ## hostname port host subdomain assigned tld ## ## 1 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 2 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 3 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 4 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 5 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 6 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 7 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 8 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 9 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 10 www 80 www.eestikirik.ee NA eestikirik.ee ee ## # ... with more rows
While we could have done it all in SQL, we saved some bits for R:
distinct(gfg_df, assigned, port) %>% count(port) %>% collect() -> port_counts port_counts # A tibble: 2 x 2 port n * 1 80 20648 2 443 22178
You’d think more news-oriented sites would be HTTPS by default given the current global political climate (though those lock icons are no safety panacea by any stretch of the imagination).
Now, R can do URL & IDN slicing, but Drill can operate at-scale. That is, R’s
urltools package may be fine for single-node, in-memory ops, but Drill can process billions of URLs when part of a cluster.
I’m not 100% settled on the
galimatias library for URL parsing (I need to do some extended testing) and I may add some less-strict IDN slicing & dicing functions as well.
Kick the tyres & file issues & PRs as necessary.