[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I just came across a really interesting and simple algorithm for estimating the number of distinct elements in a stream of data. The paper (Chakraborty et al. 2023) is available on arXiv; see this Quanta article (Reference 2) for a layman’s explanation.

Problem statement

Let’s state the problem formally. Let’s say we are given a stream $\mathcal{A} = \langle a_1, \dots, a_m \rangle$ where each $a_i \in [n]$, and let $\mathsf{F}_0(\mathcal{A})$ denote the number of distinct elements of $\mathcal{A}$. For any input parameters $(\epsilon, \delta)$, we want an $(\epsilon, \delta)$-approximation of $\mathsf{F}_0(\mathcal{A})$, i.e. a number $c$ such that

\begin{aligned} \mathbb{P} \left\{ (1-\epsilon) \cdot \mathsf{F}_0(\mathcal{A}) \leq c \leq (1+\epsilon) \cdot \mathsf{F}_0(\mathcal{A}) \right\} \geq 1 - \delta. \end{aligned}

A naive solution would be to maintain a buffer $\mathcal{X}$ which keeps track of all distinct elements seen so far. As we inspect each new item, we check if the item is in $\mathcal{X}$: if it’s there we discard the new item, and if it’s not we add it to $\mathcal{X}$. After going through the $m$ items, the size of $\mathcal{X}$ is exactly the number of distinct items. However, it has bad worst case memory and checking whether an element is in $\mathcal{X}$ can be costly if $\mathcal{X}$ is large (e.g. think of counting the number of distinct members visiting google.com today).

Algorithm

There is a long line of work seeking to do better than the naive solution (see Bibliographic Remarks in Reference 1); Chakraborty et al. claims to be the first state-of-the-art algorithm whose proof requires just basic probability theory. The algorithm is easy enough to be reproduced here; see Reference 2 for an excellent walkthrough of an example run of the algorithm.

At a high level, the algorithm maintains a buffer $\mathcal{X}$ of seen items. The algorithm proceeds in rounds. In each round, we check whether each incoming item is in $\mathcal{X}$ or not. If it is not in $\mathcal{X}$, add it with probability $p$. If it is in $\mathcal{X}$, keep it in with probability $p$. $p$ is made smaller and smaller as we go through more rounds. Whenever the buffer reaches its threshold number of items (threshold fixed throughout the algorithm), each element is removed from it with probability $0.5$. When we run out of elements, our estimate of the number of distinct items is $|\mathcal{X}| / p$.

Here is the theorem for this algorithm:

Theorem. For any data stream $\mathcal{A}$ and any $0, \epsilon, \delta < 1$, the algorithm above outputs an $(\epsilon, \delta)$-approximation of $\mathsf{F}_0(\mathcal{A})$. It uses $O(\frac{1}{\epsilon^2} \cdot \log n \cdot (\log m + \log \frac{1}{\delta}))$ space in the worst case.

The proof is not too long (about 3 pages) and while requiring some restatement of the problem via alternate algorithms, the most advanced tool it uses is Chernoff’s bound.

To obtain an $(\epsilon, \delta)$-approximation for $m$ items, the algorithm requires a buffer of size $\left\lceil \frac{12}{\epsilon}^2 \log \left( \frac{8m}{\delta}\right) \right\rceil$. This grows logarithmically with $m$. If we want our approximation to be twice as accurate, we need the buffer to be 4 times as large ($\epsilon$). In practice the buffer can probably be much smaller: the authors are only interested in big-O type dependencies and so made the constants bigger wherever it made the proof simpler. (It’s worth noting that the algorithm gives the exact number of distinct items if the buffer happens to be larger than the true number of distinct items.)

Simulations

This algorithm is really easy to implement. Here is a quick implementation in R (definitely can be optimized further, just coded it up quickly for prototyping purposes):

# threshold: number of elements buffer can hold
# trajectory: if TRUE, return the estimate after every item in stream (default FALSE)
estimate_distinct_elements <- function(data_stream, threshold, trajectory = FALSE) {
p <- 1
m <- length(data_stream)
buffer <- c()
hit_fail_event <- FALSE
estimates <- rep(NA, m)
for (i in 1:m) {
buffer <- setdiff(buffer, data_stream[i])
if (runif(1) < p) buffer <- c(buffer, data_stream[i])
if (length(buffer) == threshold) {
buffer <- buffer[runif(threshold) < 0.5]
p <- p / 2
if (length(buffer) == threshold) {  # fail event
hit_fail_event <- TRUE
}
}
estimates[i] <- ifelse(hit_fail_event, -1, length(buffer) / p)
}

if (trajectory) {
return(estimates)
} else {
return(estimates[m])
}
}


Let’s take this for a run with some simulations. All simulation code can be found here. For a start, let’s generate a stream of length 100, where each item is an integer from 1 to 100 (inclusive). For the particular data stream we generated, there were 60 distinct elements. This means that for the naive solution, we would have buffer of length 60.

First, we run the algorithm 1000 times with buffer length 30, half of what the naive solution needed. The algorithm never hit the “fail” event, and here are some summary statistics of the distinct count estimate:

#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#  34.00   52.00   56.00   60.02   68.00  104.00


The mean and median are pretty close to the true value! Here is a histogram of the estimates:

It’s worth noting that while the estimate is generally pretty good (50% chance of being within 8 of the true number), it can be wildly off as well!

Let’s rerun the simulation for the same data stream but with buffer length 15 (a quarter of the true number of distinct items). Again the algorithm never hit the “fail” event. The summary statistics and histogram are below and as one might expect, while the estimates are still centered at the right place, there is more variance in the estimate. (The histograms might look similar due to different scales on the x-axis.)

#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#   8.00   48.00   56.00   59.81   72.00  144.00


Since this is a streaming algorithm, it provides an estimate of the number of distinct items after inspecting each item. Let’s look at how good the estimate is over time. Here are the plots for 10 estimate trajectories (true count in blue) for buffer length = 30 and buffer length = 15. Notice that the algorithm gives an exact estimate until the buffer is first filled. As expected, larger buffer length tends to give better estimates.

Let’s do one final simulation: Let’s look at a stream of length 10,000 with each element being a random integer from 1 to 4,000 (inclusive). For our data, the true number of distinct elements is 3,686.

Here is the plot of 10 trajectory estimates with buffer length 500 (about 14% of the true number of uniques) as well as the relative error of the estimate:

That’s not too bad! The estimates are pretty much within 10% of the true values. How about a much smaller buffer length of 50 (1.4% of the true number of uniques):

There’s much wider variation in the estimates now, probably too much to be useful for any real-world application.

References:

1. Chakraborty, S., et al. (2023). Distinct Elements in Streams: An Algorithm for the (Text) Book.
2. Nadis, S. (2024). Computer Scientists Invent an Efficient New Way to Count.