Wicked Fast, Accurate Quantiles Using ‘t-Digests’ in R with the {tdigest} Package
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
@ted_dunning recently updated the t-Digest algorithm he created back in 2013. What is this “t-digest”? Fundamentally, it is a probabilistic data structure for estimating any percentile of distributed/streaming data. Ted explains it quite elegantly in this short video:
Said video has a full transcript as well.
T-digests have been baked into many “big data” analytics ecosystems for a while but I hadn’t seen any R packages for them (ref any in a comment if you do know of some) so I wrapped one of the low-level implementation libraries by ajwerner into a diminutive R package boringly, but appropriately named tdigest
:
There are wrappers for the low-level accumulators and quantile/value extractors along with vectorised functions for creating t-digest objects and retrieving quantiles from them (including a tdigest
S3 method for stats::quantile()
).
This:
install.packages("tdigest", repos="https://cinc.rud.is/")
will install from source or binaries onto your system(s).
Basic Ops
The low-level interface is more useful in “streaming” operations (i.e. accumulating input over time):
set.seed(2019-04-03) td <- td_create() for (i in 1:100000) { td_add(td, sample(100, 1), 1) } quantile(td) ## [1] 1.00000 25.62222 53.09883 74.75522 100.00000
More R-like Ops
Vectorisation is the name of the game in R and we can use tdigest()
to work in a vectorised manner:
set.seed(2019-04-03) x <- sample(100, 1000000, replace=TRUE) td <- tdigest(x) quantile(td) ## [1] 1.00000 25.91914 50.79468 74.76439 100.00000
Need for Speed
The t-digest algorithm was designed for both streaming operations and speed. It’s pretty, darned fast:
microbenchmark::microbenchmark( tdigest = tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)), r_quantile = quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)) ) ## Unit: microseconds ## expr min lq mean median uq max neval ## tdigest 22.81 26.6525 48.70123 53.355 63.31 151.29 100 ## r_quantile 57675.34 59118.4070 62992.56817 60488.932 64731.23 160130.50 100
Note that “accurate” is not the same thing as “precise”, so regular quantile ops in R will be close to what t-digest computes, but not always exactly the same.
FIN
This was a quick (but, complete) wrapper and could use some tyre kicking. I’ve a mind to add serialization to the C implementation so I can then enable [de]serialization on the R-side since that would (IMO) make t-digest ops more useful in an R-context, especially since you can merge two different t-digests.
As always, code/PR where you want to and file issues with any desired functionality/enhancements.
Also, whomever started the braces notation for package names (e.g. {ggplot2}): brilliant!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.