# German Tanks, Statistical Intelligence

May 25, 2010
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In World War II, the Allies had a problem: German tanks were often captured, but how many more did the Nazis have in reserve? Allied intelligence estimated around 1400 Panther tanks were being produced a month: a formidable arsenal, and perhaps an insurmountable one given the much smaller numbers being captured or destroyed. But those captured tanks provided exactly the clue needed to get a more realistic assessment of the German production rates: serial numbers. Assuming that the serial numbers on the tanks were assigned in sequence, and looking at the serial numbers of those captured, Allied statisticians came up with simple arithmetic formula  to come up with a less daunting estimate of the production rate: 256 per month. (Production data revealed after the war revealed the true figure to be 255 per month.)

Suppose N tanks were actually produced, and k were captured with the highest serial number observed being m. The formula (rather intuitively) estimates the actual number of tanks as the maximum serial number plus the average gap between serial numbers observed, or in formula form, m + m/k – 1. But how well does this work in practice for different values of N (actual tanks)? In a statistical version of a World War II reenactment, the Statistics Blog has recreated the German Tank Problem using R, simulating captured serial numbers for various values of N, and comparing this "true" value with the estimate obtained. The process was repeated for various values of N, and the results plotted against each other. Here are the results:

As evidenced by history, the formula is startlingly accurate, even though k, the number of captured tanks, is set at just 20! I’d never come across the German Tank Problem before, so thanks to Statistics Blog for providing such a great illustration of it. I also liked the commentary on Maximum Likelihood Estimation, as this provides one example where the MLE fails:

The MLE for the number of German tanks is the highest serial number observed. This is because MLE works backwards, finding the parameter which makes our observation most likely in terms of joint conditional probability. As a result, the MLE for this problem is not only biased (since it will always be less than or equal to the true number of tanks), but dumb as well.

Statistics Blog: How many tanks? MC testing the GTP

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...