Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

@ted_dunning recently updated the t-Digest algorithm he created back in 2013. What is this “t-digest”? Fundamentally, it is a probabilistic data structure for estimating any percentile of distributed/streaming data. Ted explains it quite elegantly in this short video:

Said video has a full transcript as well.

T-digests have been baked into many “big data” analytics ecosystems for a while but I hadn’t seen any R packages for them (ref any in a comment if you do know of some) so I wrapped one of the low-level implementation libraries by ajwerner into a diminutive R package boringly, but appropriately named tdigest:

There are wrappers for the low-level accumulators and quantile/value extractors along with vectorised functions for creating t-digest objects and retrieving quantiles from them (including a tdigest S3 method for stats::quantile()).

This:

install.packages("tdigest", repos="https://cinc.rud.is/")


will install from source or binaries onto your system(s).

### Basic Ops

The low-level interface is more useful in “streaming” operations (i.e. accumulating input over time):

set.seed(2019-04-03)

td <- td_create()

for (i in 1:100000) {
}

quantile(td)
## [1]   1.00000  25.62222  53.09883  74.75522 100.00000


### More R-like Ops

Vectorisation is the name of the game in R and we can use tdigest() to work in a vectorised manner:

set.seed(2019-04-03)

x <- sample(100, 1000000, replace=TRUE)

td <- tdigest(x)

quantile(td)
## [1]   1.00000  25.91914  50.79468  74.76439 100.00000


### Need for Speed

The t-digest algorithm was designed for both streaming operations and speed. It’s pretty, darned fast:

microbenchmark::microbenchmark(
tdigest = tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)),
r_quantile = quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
)
## Unit: microseconds
##        expr      min         lq        mean    median       uq       max neval
##     tdigest    22.81    26.6525    48.70123    53.355    63.31    151.29   100
##  r_quantile 57675.34 59118.4070 62992.56817 60488.932 64731.23 160130.50   100


Note that “accurate” is not the same thing as “precise”, so regular quantile ops in R will be close to what t-digest computes, but not always exactly the same.

### FIN

This was a quick (but, complete) wrapper and could use some tyre kicking. I’ve a mind to add serialization to the C implementation so I can then enable [de]serialization on the R-side since that would (IMO) make t-digest ops more useful in an R-context, especially since you can merge two different t-digests.

As always, code/PR where you want to and file issues with any desired functionality/enhancements.

Also, whomever started the braces notation for package names (e.g. {ggplot2}): brilliant!