[This article was first published on R [english] – NC233, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
tl;dr -I don’t remember how many games of Clue I’ve played but I do remember being surprised by Mrs White being the murderer in only 2 of those games. Can you give an estimate and an upper bound for the number of games I have played? We solve this problem by using Bayes theorem and discussing the data generation mechanism, and illustrate the solution with R.
Making use of external information with Bayes theorem
Having been raised a frequentist, I first tried to solve this using a max likelihood method, but quickly gave up when I realized how intractable it actually was, especially for the upper bound.
This is a problem where conditioning on external knowledge is key, so the most natural way to tackle it is to use Bayes theorem. This will directly yield an interpretable probability for what we’re looking for (most probable number of and uncertainty interval)
Denote an integer n>3 and:
What we want writes as a simple product of quantities that we can compute, thanks to Bayes:
Notice that there is an “proportional to” sign instead of an equal. This is because the denominator is just a normalization constant, which we can take care of easily after computing the likelihood and the prior.
The likelihood indicates the odds of us observing the data (in this case, that k_Mrs_White = 2) given the value of the unknown parameter (here the number of games played). Since at the beginning of each game the murderer is chosen at uniform random between 6 characters, the number of times Mrs White ends up being the culprit can be modeled as a binomial distribution with parameters n and 1/6.
This will be easily obtained using the dbinom function, which gives directly the exact value of P(x = k), for any x and a binomial distribution of parameters n and p. Let’s first import a few useful functions that I put in our GitHub repo to save some space on this post, and set a few useful parameters:
k_mrs_white <- 2 # Number of times Mrs. White was the murderer
prob <- 1/6 # Probability of Mrs. White being the murderer for one game
Note that we can’t exactly obtain the distribution for any game from 1 to infinity, so we actually compute the distribution for 1 to 200 games (this doesn’t matter much in practice):
x <- 1:200 # Reduction of the problem to a finite number of games
dlikelihood <- dbinom(k_mrs_white, x, prob)
Side note: when I was a student, I kept forgetting that the distribution functions existed in R and whenever I needed them I used to re-generate them using the random generation function (rbinom in this case)
There are a lot of possible choices for the prior but here I’m going to consider that I don’t have any real reason to believe and assume a uniform probability for any number of games between 3 and 200:
I find this result very unsatisfying. It doesn’t “feel” right to me that someone would be surprised by only 2 occurrences of Mrs White being guilty in such a small number of games! For example, I simulated 40 games, a number that was supposedly suspiciously high according to the model:
We observe that characters #4 and #5 are the murderers in respectively only 2 and 3 games!
In the end I think what really counts is not the likelihood that *Mrs White* was the murderer 2 times, but that the *minimum* number of times one of the characters was the culprit was 2 times!
I think it’s a cool example of a problem where just looking at the data available is not enough to make good inference – you also have to think about *how* the data was generated (in a way, it’s sort of a twist of the Monty Hall paradox, which is one of the most famous examples of problems where the data generation mechanism is critical to understand the result).
I wrote a quick and dirty function based on simulations to generate this likelihood, given a certain number of games. I saved the distribution directly in the GitHub (and called it Gumbel since it kinda looks like an extreme value problem) so that we can call it and do the same thing as above: