Frequentist German Tank Problem

March 20, 2014
By

(This article was first published on jacobsimmering.com, and kindly contributed to R-bloggers)



The German Tank Problem: The Frequentist Way

Many things are given a serial number and often that serial number, logically,
starts at 1 and for each new unit is increased by 1. For example, German tanks
in World War II had several parts with serial numbers. By collecting the value
of these numbers, Allied statisticians could produce estimates of the total
number of tanks produced by the Germans in a given time period.

The idea behind this was that the serial numbers encountered in the field were
samples from a discrete uniform distribution starting at 1 and terminating at
some unknown value \( N \), where \( N \) is the true number of tanks built. The problem
then became to use the information from the tanks captured/destroyed in
battle estimate the value of \( N \).

If you had an urn with numbered balls from 1 to \( N \) and draw \( k \) balls with
replacement, the maximum value observed in the sample is a reasonable estimator
of \( N \).

A simple simulation shows that this is the case:

sampleUrn <- function(k, N = 10000, r = 100) {
    urn <- 1:N
    sampleMax <- replicate(r, max(sample(urn, k, replace = TRUE)))
    sampleMax
}
k <- seq(1, 10000, length.out = 100)
sampleMax <- lapply(k, sampleUrn)
sampleMax <- data.frame(maxValue = do.call(c, sampleMax), k = rep(k, each = 100))
ggplot(sampleMax, aes(x = k, y = maxValue)) + geom_point() + scale_y_continuous("Sample Maximum")

Indeed, as suggested by the plot, the estimator is consistent for \( N \) (a formal
treatment is in Example 5.1.1 in Hogg, McKean and Craig 6th edition).

However, try as they might, the Allies couldn't capture enough tanks to make
an appeal to asymptotic theory. For fairly small sample sizes, the bias of this
estimator is significant. This makes sense as the sample max cannot exceed but
only equal the maximum value of the support and the probability of a draw
containing a given sample depends on the size of the draw (and the sample
space).

Furthermore, they weren't sampling the tanks with replacement. The information
about the tank's serial numbers came when the tank was captured or disabled in
combat. Therefore, once observed, the tank can never be observed again. This
complicates the pmf for \( M \) somewhat but provides more accurate estimation
of \( N \) for a fixed sample than without accepting this complexity.

Nevertheless, it is possible to produce an estimator of \( N \), \( \hat{N} \), that
is unbiased for small sample sizes and captures the fact that the sampling is
done without replacement. To start simply,

\[
E(M; k) = \sum_{m = 0}^{N} m \text{Pr}(m)
\]

or the expectation of the sample maximum, \( M \), given a draw of size \( k \) is
simply the sum of the products of \( m \) times \( \text{Pr}(m) \) where
\( m \in {1, 2, \ldots} \).

If we recognize that we have observed \( k \) tanks, it is not possible for \( N < k \) because we are sampling without replacement and have seen \( k \) unique tanks, we can split the summation into two parts

\[
E(M; k) = \sum_{m = 0}^{k-1} m \text{Pr}(m) + \sum_{m = k}^{N} m \text{Pr}(m)
\]

and since \( Pr(N = m) = 0 \) for \( m < k \), this reduces to

\[
E(M; k) = \sum_{m = k}^{N} m \text{Pr}(m)
\]

Since the sampling is without replacement, the pmf of \( M \) is the number of
ways to select \( k-1 \) tanks from a set of \( m-1 \) tanks divided by the total number
of ways to select \( k \) tanks from the total set of \( N \) tanks. After questions
involving cards, this is my least favorite part of math stats, but it appears
that this expression is given by

\[
\text{Pr}(M = m) = \frac{\binom{m-1}{k-1}}{\binom{N}{k}}
\]

If we plug this in the the summation above for \( \text{Pr}(m) \), we get

\[
E(M; k) = \sum_{m = k}^{N} m \frac{\binom{m-1}{k-1}}{\binom{N}{k}}
\]

Which, even without expansion of the binomial coefficients into a factorial
form, is ugly. And so expansion is exactly what we do

\[
E(M; k) = \sum_{m = k}^{N} m \frac{\frac{(m-1)!}{(k-1)!((m-1)-(k-1))!}}{\binom{N}{k}}
\]

And if we bring the \( m \) into the expression

\[
E(M; k) = \sum_{m = k}^{N} \frac{\frac{m!}{(k-1)!((m - k))!}}{\binom{N}{k}}
\]

The expansion would be much nicer now if we could somehow put it back into a
binomial coefficient. It is close to \( \binom{m}{k} \) but we would need a
\( \frac{1}{k} \) multiplier. If we multiply the top and bottom of the expansion
by \( k \), we do not change the value but we arrive at

\[
E(M; k) = \sum_{m = k}^{N} \frac{\frac{km!}{k(k-1)!((m - k))!}}{\binom{N}{k}}
\]

which can be rewritten as

\[
E(M; k) = \sum_{m = k}^{N} \frac{k\binom{m}{k}}{\binom{N}{k}}
\]

As far as the summation is concerned, \( k \) and \( \binom{N}{k} \) are constants
and we can rewrite the expression as

\[
E(M; k) = \frac{k}{\binom{N}{k}} \sum_{m = k}^{N} \binom{m}{k}
\]

At first glance, this doesn't look like much of an improvement. However,
now we just have the summation of a single binomial coefficient without any
multiplication inside the summation.

Recall from earlier, the pmf of \( M \) is

\[
\text{Pr}(M = m) = \frac{\binom{m-1}{k-1}}{\binom{N}{k}}
\]

By the definition of a pmf, the sum of \( \text{Pr}(M = m) \) over the support
of \( m \) is 1. As the values of \( \text{Pr}(m) \) are 0 for \( m < k \), this becomes

\[
1 = \sum_{m = k}^{n} \frac{\binom{m-1}{k-1}}{\binom{N}{k}}
\]

The \( \frac{1}{\binom{N}{k}} \) can be pulled out giving

\[
1 = \frac{1}{\binom{N}{k}} \sum_{m = k}^{n} \binom{m-1}{k-1}
\]

In order for this expression to be true,

\[
\binom{N}{k} = \sum_{m = k}^{N} \binom{m-1}{k-1}
\]

Now returning to the expectation, we see something similar with
\( \sum_{m = k}^{N} \binom{m}{k} \), however, we lack the “\( -1 \)” bit. If we realize
that playing with the indices by adding or subtracting a constant doesn't
change the value of the expression as long as we change all the indices, lets
rewrite the expectation as

\[
E(M; k) = \frac{k}{\binom{N}{k}} \sum_{m = k+1}^{N+1} \binom{m - 1}{k - 1}
\]

we now have an expression in the right form for this trick based on the pmf to
apply. Taking relationship \( \binom{N}{k} = \sum_{m = k}^{N} \binom{m-1}{k-1} \)
and substituting, we arrive at

\[
E(M; k) = \frac{k}{\binom{N}{k}} \binom{N + 1}{k + 1}
\]

No more ugly summations! If we expand the binomial coefficients we get to

\[
E(M; k) = \frac{k \frac{(N + 1)!}{(k+1)!(N-k)!}} {\frac{N!}{k! (N-k)!}}
\]
\[
E(M; k) = \frac{k(N+1)!k!}{(k+1)!N!}
\]

If we realize that \( \frac{k!}{(k+1)!} \) cancels to \( \frac{1}{k+1} \) and that
\( \frac{(n+1)!}{n!} \) reduces to \( n+1 \), we can write this as

\[
E(M; k) = \frac{k (N+1)}{k+1}
\]

By solving this expression for \( N \),

\[
N = \frac{M k + M}{k} - 1
\]

Therefore,

\[
E(N) = E\bigg(\frac{M k + M}{k} - 1\bigg) = \frac{M k + M}{k} - 1
\]

Note that this can be reduced to

\[
E(N) = M \bigg(1 + \frac{1}{k}\bigg) - 1
\]

for simplicity.

Since this is an R-centric blog after all, lets check the performance of this
estimator against using just the sample maximum for a collection of fairly
small samples. According to
Wikipedia,
a reasonable number for \( N \) under the historical context is about 300.

germanTankSim <- function(k, N) {
    tanks <- 1:N
    m <- max(sample(tanks, k))
    nhat <- m * (1 + 1/k) - 1
    nhat
}
sampleMaxEstimator <- function(k, N) {
    tanks <- 1:N
    m <- max(sample(tanks, k))
    m
}

N <- 300
k <- round(seq(0.01, 1, 0.01) * N)
k[k == 0] <- 1  # doesn't make sense with k = 0
tanks <- sapply(rep(k, each = 1000), germanTankSim, N)
tanks <- data.frame(adjustedMax = tanks, k = rep(k, each = 1000))
tanks$sampleMax <- sapply(rep(k, each = 1000), sampleMaxEstimator, N)

If you visualize the corrected estimator's estimate as a function of the number
of observed tanks, you see a pretty smooth decline in the variance of the
estimate and a relatively unbiased estimator of \( N \). Note that I've used
both jitter and alpha blending to make the changes in distribution a little
more clear.

ggplot(tanks, aes(x = k)) + geom_jitter(aes(y = adjustedMax), alpha = 0.1, position = position_jitter(width = 1, 
    height = 1)) + geom_hline(aes(yintercept = 300), lty = 2) + scale_x_continuous("Number of Observed Tanks") + 
    scale_y_continuous("Unbiased Estimate of Number of Total Tanks")

Turning towards the sample maximum, it is clearly biased towards lower values
for the total number of tanks. Given that the probability for a draw of a very
small size, say 10% of the population, to contain the population maximum value,
this makes sense. The bias decays with increased numbers of observed tanks and
converges to the true value when \( k \sim N \).

ggplot(tanks, aes(x = k)) + geom_jitter(aes(y = sampleMax), alpha = 0.1, position = position_jitter(width = 1, 
    height = 1)) + geom_hline(aes(yintercept = 300), lty = 2) + scale_x_continuous("Number of Observed Tanks") + 
    scale_y_continuous("Sample Max")

If we calculate the mean square error at each value of \( k \) for the two
estimators, we see that the adjusted estimator has a slightly more accurate
estimate of \( N \) than the raw sample mean.

mse <- function(value, N) {
    mean((value - N)^2)
}
esd <- aggregate(cbind(tanks$adjustedMax, tanks$sampleMax), by = list(tanks$k), 
    mse, N = 300)
names(esd) <- c("k", "adjustMax", "sampleMax")
esd <- data.frame(k = rep(esd$k, 2), mse = c(esd$adjustMax, esd$sampleMax), 
    estimator = rep(c("Adjusted", "Sample Max"), each = 100))
ggplot(esd, aes(x = k, y = mse, color = estimator)) + geom_line()

While it started out with tanks, this method can be useful for estimating a
large number of different maximum values based on sequential serial numbers.
In the absence of sales data, estimation based on serial numbers observed in the
wild can often provide reasonable estimates of the total number of units. It may
also spell out the end of humanity.

It also has somewhat different estimates using Bayesian approaches, specifically
a non-finite mean for \( k = 1 \) or \( k = 2 \) without setting a prior limit on the
number of units.

And most importantly, it is fun math, statistics and computing.

To leave a comment for the author, please follow the link and comment on their blog: jacobsimmering.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)