In World War II, the Allies had a problem: German tanks were often captured, but how many more did the Nazis have in reserve? Allied intelligence estimated around 1400 Panther tanks were being produced a month: a formidable arsenal, and perhaps an insurmountable one given the much smaller numbers being captured or destroyed. But those captured tanks provided exactly the clue needed to get a more realistic assessment of the German production rates: serial numbers. Assuming that the serial numbers on the tanks were assigned in sequence, and looking at the serial numbers of those captured, Allied statisticians came up with simple arithmetic formula to come up with a less daunting estimate of the production rate: 256 per month. (Production data revealed after the war revealed the true figure to be 255 per month.)
Suppose N tanks were actually produced, and k were captured with the highest serial number observed being m. The formula (rather intuitively) estimates the actual number of tanks as the maximum serial number plus the average gap between serial numbers observed, or in formula form, m + m/k - 1. But how well does this work in practice for different values of N (actual tanks)? In a statistical version of a World War II reenactment, the Statistics Blog has recreated the German Tank Problem using R, simulating captured serial numbers for various values of N, and comparing this "true" value with the estimate obtained. The process was repeated for various values of N, and the results plotted against each other. Here are the results:
The MLE for the number of German tanks is the highest serial number observed. This is because MLE works backwards, finding the parameter which makes our observation most likely in terms of joint conditional probability. As a result, the MLE for this problem is not only biased (since it will always be less than or equal to the true number of tanks), but dumb as well.
Statistics Blog: How many tanks? MC testing the GTP