**jacobsimmering.com**, and kindly contributed to R-bloggers)

# The German Tank Problem: The Frequentist Way

Many things are given a serial number and often that serial number, logically,

starts at 1 and for each new unit is increased by 1. For example, German tanks

in World War II had several parts with serial numbers. By collecting the value

of these numbers, Allied statisticians could produce estimates of the total

number of tanks produced by the Germans in a given time period.

The idea behind this was that the serial numbers encountered in the field were

samples from a discrete uniform distribution starting at 1 and terminating at

some unknown value \( N \), where \( N \) is the true number of tanks built. The problem

then became to use the information from the tanks captured/destroyed in

battle estimate the value of \( N \).

If you had an urn with numbered balls from 1 to \( N \) and draw \( k \) balls with

replacement, the maximum value observed in the sample is a reasonable estimator

of \( N \).

A simple simulation shows that this is the case:

```
sampleUrn <- function(k, N = 10000, r = 100) {
urn <- 1:N
sampleMax <- replicate(r, max(sample(urn, k, replace = TRUE)))
sampleMax
}
k <- seq(1, 10000, length.out = 100)
sampleMax <- lapply(k, sampleUrn)
sampleMax <- data.frame(maxValue = do.call(c, sampleMax), k = rep(k, each = 100))
ggplot(sampleMax, aes(x = k, y = maxValue)) + geom_point() + scale_y_continuous("Sample Maximum")
```

Indeed, as suggested by the plot, the estimator is consistent for \( N \) (a formal

treatment is in Example 5.1.1 in Hogg, McKean and Craig 6th edition).

However, try as they might, the Allies couldn't capture enough tanks to make

an appeal to asymptotic theory. For fairly small sample sizes, the bias of this

estimator is significant. This makes sense as the sample max cannot exceed but

only equal the maximum value of the support and the probability of a draw

containing a given sample depends on the size of the draw (and the sample

space).

Furthermore, they weren't sampling the tanks with replacement. The information

about the tank's serial numbers came when the tank was captured or disabled in

combat. Therefore, once observed, the tank can never be observed again. This

complicates the pmf for \( M \) somewhat but provides more accurate estimation

of \( N \) for a fixed sample than without accepting this complexity.

Nevertheless, it is possible to produce an estimator of \( N \), \( \hat{N} \), that

is unbiased for small sample sizes and captures the fact that the sampling is

done without replacement. To start simply,

\[

E(M; k) = \sum_{m = 0}^{N} m \text{Pr}(m)

\]

or the expectation of the sample maximum, \( M \), given a draw of size \( k \) is

simply the sum of the products of \( m \) times \( \text{Pr}(m) \) where

\( m \in {1, 2, \ldots} \).

If we recognize that we have observed \( k \) tanks, it is not possible for \( N < k \) because we are sampling without replacement and have seen \( k \) unique tanks, we can split the summation into two parts

\[

E(M; k) = \sum_{m = 0}^{k-1} m \text{Pr}(m) + \sum_{m = k}^{N} m \text{Pr}(m)

\]

and since \( Pr(N = m) = 0 \) for \( m < k \), this reduces to

\[

E(M; k) = \sum_{m = k}^{N} m \text{Pr}(m)

\]

Since the sampling is without replacement, the pmf of \( M \) is the number of

ways to select \( k-1 \) tanks from a set of \( m-1 \) tanks divided by the total number

of ways to select \( k \) tanks from the total set of \( N \) tanks. After questions

involving cards, this is my least favorite part of math stats, but it appears

that this expression is given by

\[

\text{Pr}(M = m) = \frac{\binom{m-1}{k-1}}{\binom{N}{k}}

\]

If we plug this in the the summation above for \( \text{Pr}(m) \), we get

\[

E(M; k) = \sum_{m = k}^{N} m \frac{\binom{m-1}{k-1}}{\binom{N}{k}}

\]

Which, even without expansion of the binomial coefficients into a factorial

form, is ugly. And so expansion is exactly what we do

\[

E(M; k) = \sum_{m = k}^{N} m \frac{\frac{(m-1)!}{(k-1)!((m-1)-(k-1))!}}{\binom{N}{k}}

\]

And if we bring the \( m \) into the expression

\[

E(M; k) = \sum_{m = k}^{N} \frac{\frac{m!}{(k-1)!((m - k))!}}{\binom{N}{k}}

\]

The expansion would be much nicer now if we could somehow put it back into a

binomial coefficient. It is close to \( \binom{m}{k} \) but we would need a

\( \frac{1}{k} \) multiplier. If we multiply the top and bottom of the expansion

by \( k \), we do not change the value but we arrive at

\[

E(M; k) = \sum_{m = k}^{N} \frac{\frac{km!}{k(k-1)!((m - k))!}}{\binom{N}{k}}

\]

which can be rewritten as

\[

E(M; k) = \sum_{m = k}^{N} \frac{k\binom{m}{k}}{\binom{N}{k}}

\]

As far as the summation is concerned, \( k \) and \( \binom{N}{k} \) are constants

and we can rewrite the expression as

\[

E(M; k) = \frac{k}{\binom{N}{k}} \sum_{m = k}^{N} \binom{m}{k}

\]

At first glance, this doesn't look like much of an improvement. However,

now we just have the summation of a single binomial coefficient without any

multiplication inside the summation.

Recall from earlier, the pmf of \( M \) is

\[

\text{Pr}(M = m) = \frac{\binom{m-1}{k-1}}{\binom{N}{k}}

\]

By the definition of a pmf, the sum of \( \text{Pr}(M = m) \) over the support

of \( m \) is 1. As the values of \( \text{Pr}(m) \) are 0 for \( m < k \), this becomes

\[

1 = \sum_{m = k}^{n} \frac{\binom{m-1}{k-1}}{\binom{N}{k}}

\]

The \( \frac{1}{\binom{N}{k}} \) can be pulled out giving

\[

1 = \frac{1}{\binom{N}{k}} \sum_{m = k}^{n} \binom{m-1}{k-1}

\]

In order for this expression to be true,

\[

\binom{N}{k} = \sum_{m = k}^{N} \binom{m-1}{k-1}

\]

Now returning to the expectation, we see something similar with

\( \sum_{m = k}^{N} \binom{m}{k} \), however, we lack the “\( -1 \)” bit. If we realize

that playing with the indices by adding or subtracting a constant doesn't

change the value of the expression as long as we change all the indices, lets

rewrite the expectation as

\[

E(M; k) = \frac{k}{\binom{N}{k}} \sum_{m = k+1}^{N+1} \binom{m - 1}{k - 1}

\]

we now have an expression in the right form for this trick based on the pmf to

apply. Taking relationship \( \binom{N}{k} = \sum_{m = k}^{N} \binom{m-1}{k-1} \)

and substituting, we arrive at

\[

E(M; k) = \frac{k}{\binom{N}{k}} \binom{N + 1}{k + 1}

\]

No more ugly summations! If we expand the binomial coefficients we get to

\[

E(M; k) = \frac{k \frac{(N + 1)!}{(k+1)!(N-k)!}} {\frac{N!}{k! (N-k)!}}

\]

\[

E(M; k) = \frac{k(N+1)!k!}{(k+1)!N!}

\]

If we realize that \( \frac{k!}{(k+1)!} \) cancels to \( \frac{1}{k+1} \) and that

\( \frac{(n+1)!}{n!} \) reduces to \( n+1 \), we can write this as

\[

E(M; k) = \frac{k (N+1)}{k+1}

\]

By solving this expression for \( N \),

\[

N = \frac{M k + M}{k} - 1

\]

Therefore,

\[

E(N) = E\bigg(\frac{M k + M}{k} - 1\bigg) = \frac{M k + M}{k} - 1

\]

Note that this can be reduced to

\[

E(N) = M \bigg(1 + \frac{1}{k}\bigg) - 1

\]

for simplicity.

Since this is an R-centric blog after all, lets check the performance of this

estimator against using just the sample maximum for a collection of fairly

small samples. According to

Wikipedia,

a reasonable number for \( N \) under the historical context is about 300.

```
germanTankSim <- function(k, N) {
tanks <- 1:N
m <- max(sample(tanks, k))
nhat <- m * (1 + 1/k) - 1
nhat
}
sampleMaxEstimator <- function(k, N) {
tanks <- 1:N
m <- max(sample(tanks, k))
m
}
N <- 300
k <- round(seq(0.01, 1, 0.01) * N)
k[k == 0] <- 1 # doesn't make sense with k = 0
tanks <- sapply(rep(k, each = 1000), germanTankSim, N)
tanks <- data.frame(adjustedMax = tanks, k = rep(k, each = 1000))
tanks$sampleMax <- sapply(rep(k, each = 1000), sampleMaxEstimator, N)
```

If you visualize the corrected estimator's estimate as a function of the number

of observed tanks, you see a pretty smooth decline in the variance of the

estimate and a relatively unbiased estimator of \( N \). Note that I've used

both jitter and alpha blending to make the changes in distribution a little

more clear.

```
ggplot(tanks, aes(x = k)) + geom_jitter(aes(y = adjustedMax), alpha = 0.1, position = position_jitter(width = 1,
height = 1)) + geom_hline(aes(yintercept = 300), lty = 2) + scale_x_continuous("Number of Observed Tanks") +
scale_y_continuous("Unbiased Estimate of Number of Total Tanks")
```

Turning towards the sample maximum, it is clearly biased towards lower values

for the total number of tanks. Given that the probability for a draw of a very

small size, say 10% of the population, to contain the population maximum value,

this makes sense. The bias decays with increased numbers of observed tanks and

converges to the true value when \( k \sim N \).

```
ggplot(tanks, aes(x = k)) + geom_jitter(aes(y = sampleMax), alpha = 0.1, position = position_jitter(width = 1,
height = 1)) + geom_hline(aes(yintercept = 300), lty = 2) + scale_x_continuous("Number of Observed Tanks") +
scale_y_continuous("Sample Max")
```

If we calculate the mean square error at each value of \( k \) for the two

estimators, we see that the adjusted estimator has a slightly more accurate

estimate of \( N \) than the raw sample mean.

```
mse <- function(value, N) {
mean((value - N)^2)
}
esd <- aggregate(cbind(tanks$adjustedMax, tanks$sampleMax), by = list(tanks$k),
mse, N = 300)
names(esd) <- c("k", "adjustMax", "sampleMax")
esd <- data.frame(k = rep(esd$k, 2), mse = c(esd$adjustMax, esd$sampleMax),
estimator = rep(c("Adjusted", "Sample Max"), each = 100))
ggplot(esd, aes(x = k, y = mse, color = estimator)) + geom_line()
```

While it started out with tanks, this method can be useful for estimating a

large number of different maximum values based on sequential serial numbers.

In the absence of sales data, estimation based on serial numbers observed in the

wild can often provide reasonable estimates of the total number of units. It may

also spell out the end of humanity.

It also has somewhat different estimates using Bayesian approaches, specifically

a non-finite mean for \( k = 1 \) or \( k = 2 \) without setting a prior limit on the

number of units.

And most importantly, it is fun math, statistics and computing.

**leave a comment**for the author, please follow the link and comment on their blog:

**jacobsimmering.com**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...