# Wicked Fast, Accurate Quantiles Using ‘t-Digests’ in R with the {tdigest} Package

**R – rud.is**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

@ted_dunning recently updated the t-Digest algorithm he created back in 2013. What is this “t-digest”? Fundamentally, it is a probabilistic data structure for estimating any percentile of distributed/streaming data. Ted explains it quite elegantly in this short video:

Said video has a full transcript as well.

T-digests have been baked into many “big data” analytics ecosystems for a while but I hadn’t seen any R packages for them (ref any in a comment if you do know of some) so I wrapped one of the low-level implementation libraries by ajwerner into a diminutive R package boringly, but appropriately named `tdigest`

:

There are wrappers for the low-level accumulators and quantile/value extractors along with vectorised functions for creating t-digest objects and retrieving quantiles from them (including a `tdigest`

S3 method for `stats::quantile()`

).

This:

install.packages("tdigest", repos="https://cinc.rud.is/")

will install from source or binaries onto your system(s).

### Basic Ops

The low-level interface is more useful in “streaming” operations (i.e. accumulating input over time):

set.seed(2019-04-03) td <- td_create() for (i in 1:100000) { td_add(td, sample(100, 1), 1) } quantile(td) ## [1] 1.00000 25.62222 53.09883 74.75522 100.00000

### More R-like Ops

Vectorisation is the name of the game in R and we can use `tdigest()`

to work in a vectorised manner:

set.seed(2019-04-03) x <- sample(100, 1000000, replace=TRUE) td <- tdigest(x) quantile(td) ## [1] 1.00000 25.91914 50.79468 74.76439 100.00000

### Need for Speed

The t-digest algorithm was designed for both streaming operations and speed. It’s pretty, darned fast:

microbenchmark::microbenchmark( tdigest = tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)), r_quantile = quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)) ) ## Unit: microseconds ## expr min lq mean median uq max neval ## tdigest 22.81 26.6525 48.70123 53.355 63.31 151.29 100 ## r_quantile 57675.34 59118.4070 62992.56817 60488.932 64731.23 160130.50 100

Note that “accurate” is not the same thing as “precise”, so regular quantile ops in R will be close to what t-digest computes, but not always exactly the same.

### FIN

This was a quick (but, complete) wrapper and could use some tyre kicking. I’ve a mind to add serialization to the C implementation so I can then enable [de]serialization on the R-side since that would (IMO) make t-digest ops more useful in an R-context, especially since you can merge two different t-digests.

As always, code/PR where you want to and file issues with any desired functionality/enhancements.

Also, whomever started the braces notation for package names (e.g. {ggplot2}): brilliant!

**leave a comment**for the author, please follow the link and comment on their blog:

**R – rud.is**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.