R⁶ Series — Random Sampling From Apache Drill Tables With R & sergeant

December 20, 2017
By

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

(For first-timers, R⁶ tagged posts are short & sweet with minimal expository; R⁶ feed)

At work-work I mostly deal with medium-to-large-ish data. I often want to poke at new or existing data sets w/o working across billions of rows. I also use Apache Drill for much of my exploratory work.

Here’s how to uniformly sample data from Apache Drill using the sergeant package:

library(sergeant)

db <- src_drill("sonar")
tbl <- tbl(db, "dfs.dns.`aaaa.parquet`")

summarise(tbl, n=n())
## # Source:   lazy query [?? x 1]
## # Database: DrillConnection
##          n
##      
## 1 19977415

mutate(tbl, r=rand()) %>% 
  filter(r <= 0.01) %>% 
  summarise(n=n())
## # Source:   lazy query [?? x 1]
## # Database: DrillConnection
##        n
##    
## 1 199808

mutate(tbl, r=rand()) %>% 
  filter(r <= 0.50) %>% 
  summarise(n=n())
## # Source:   lazy query [?? x 1]
## # Database: DrillConnection
##         n
##     
## 1 9988797

And, for groups (using a different/larger “database”):

fdns <- tbl(db, "dfs.fdns.`201708`")

summarise(fdns, n=n())
## # Source:   lazy query [?? x 1]
## # Database: DrillConnection
##            n
##        
## 1 1895133100

filter(fdns, type %in% c("cname", "txt")) %>% 
  count(type)
## # Source:   lazy query [?? x 2]
## # Database: DrillConnection
##    type        n
##       
## 1 cname 15389064
## 2   txt 67576750

filter(fdns, type %in% c("cname", "txt")) %>% 
  group_by(type) %>% 
  mutate(r=rand()) %>% 
  ungroup() %>% 
  filter(r <= 0.15) %>% 
  count(type)
## # Source:   lazy query [?? x 2]
## # Database: DrillConnection
##    type        n
##       
## 1 cname  2307604
## 2   txt 10132672

I will (hopefully) be better at cranking these bite-sized posts more frequently in 2018.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)