Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

tl;dr -I don’t remember how many games of Clue I’ve played but I do remember being surprised by Mrs White being the murderer in only 2 of those games. Can you give an estimate and an upper bound for the number of games I have played?
We solve this problem by using Bayes theorem and discussing the data generation mechanism, and illustrate the solution with R. The characters in the original game of Clue. Mrs White is third from the left on the first row (and is now retired!)

#### Making use of external information with Bayes theorem

Having been raised a frequentist, I first tried to solve this using a max likelihood method, but quickly gave up when I realized how intractable it actually was, especially for the upper bound.
This is a problem where conditioning on external knowledge is key, so the most natural way to tackle it is to use Bayes theorem. This will directly yield an interpretable probability for what we’re looking for (most probable number of and uncertainty interval)

Denote an integer n>3 and:

What we want writes as a simple product of quantities that we can compute, thanks to Bayes:
Notice that there is an “proportional to” sign instead of an equal. This is because the denominator is just a normalization constant, which we can take care of easily after computing the likelihood and the prior.
###### Likelihood
The likelihood indicates the odds of us observing the data (in this case, that k_Mrs_White = 2) given the value of the unknown parameter (here the number of games played). Since at the beginning of each game the murderer is chosen at uniform random between 6 characters, the number of times Mrs White ends up being the culprit can be modeled as a binomial distribution with parameters n and 1/6.

This will be easily obtained using the dbinom function, which gives directly the exact value of P(x = k), for any x and a binomial distribution of parameters n and p. Let’s first import a few useful functions that I put in our GitHub repo to save some space on this post, and set a few useful parameters:
library(tidyverse)
source("clue/clue_functions.R")

## Parameters
k_mrs_white <- 2 # Number of times Mrs. White was the murderer
prob <- 1/6 # Probability of Mrs. White being the murderer for one game

Note that we can’t exactly obtain the distribution for any game from 1 to infinity, so we actually compute the distribution for 1 to 200 games (this doesn’t matter much in practice):
x <- 1:200 # Reduction of the problem to a finite number of games

## Likelihood
dlikelihood <- dbinom(k_mrs_white, x, prob)

easy enough Side note: when I was a student, I kept forgetting that the distribution functions existed in R and whenever I needed them I used to re-generate them using the random generation function (rbinom in this case) ###### Prior
There are a lot of possible choices for the prior but here I’m going to consider that I don’t have any real reason to believe and assume a uniform probability for any number of games between 3 and 200:
dprior1 <- dunifdisc(x,3,100)
plot_clue_prior(x, dprior1)

###### First posterior
Using the likelihood and the prior, we can easily compute the posterior, normalize it and plot it:
dposterior1 <- dlikelihood * dprior1
dposterior1 <- dposterior1 / sum(dposterior1)
plot_clue_posterior(x, dposterior1)

We can also compute directly the estimates we’re looking for. The most probable number of games played is 11:
> which.max(dposterior1)
 11

And there is a 95% chance that the number of games is less than 40:
> threshold_val <- 0.975
> which(cumsum(dposterior1) > (threshold_val))
 40


#### A more realistic data generation mechanism

I find this result very unsatisfying. It doesn’t “feel” right to me that someone would be surprised by only 2 occurrences of Mrs White being guilty in such a small number of games! For example, I simulated 40 games, a number that was supposedly suspiciously high according to the model:
set.seed(9)
N_sim_games <- 40
sim_murderer <- runifdisc(N_sim_games, 6)

plot_murderer <- ggplot(tibble(x=1:N_sim_games, y=sim_murderer), aes(y, stat(count))) +
geom_histogram(aes(y =..count..),
bins=6, fill="white",colour="black") +
ylab("Frequency - murderer") +
xlab("Character #") +
scale_x_continuous(breaks = 1:6)
print(plot_murderer) Simulating 40 games of Clue. Character #4 and #5 were the murderer in respectively only 2 and 3 games
We observe that characters #4 and #5 are the murderers in respectively only 2 and 3 games!

In the end I think what really counts is not the likelihood that *Mrs White* was the murderer 2 times, but that the *minimum* number of times one of the characters was the culprit was 2 times!

I think it’s a cool example of a problem where just looking at the data available is not enough to make good inference – you also have to think about *how* the data was generated (in a way, it’s sort of a twist of the Monty Hall paradox, which is one of the most famous examples of problems where the data generation mechanism is critical to understand the result).

I wrote a quick and dirty function based on simulations to generate this likelihood, given a certain number of games. I saved the distribution directly in the GitHub (and called it Gumbel since it kinda looks like an extreme value problem) so that we can call it and do the same thing as above:
gumbelclue_2 <- readRDS("clue/dcluegumbel_2.rds")
gumbelclue_2 <- gumbelclue_2[x]

dposterior_gen <- gumbelclue_2 * dprior1
dposterior_gen <- dposterior_gen / sum(dposterior_gen)

plot_clue_posterior(x, dposterior_gen)

The new posterior has the same shape but appears shifted to the right. For example N_games = 50 seems much more likely now! The estimates are now 23 for the number of games:
> which.max(dposterior_gen)
 23

And 51 for the max bound of the uncertainty interval
> threshold_val <- 0.975
> which(cumsum(dposterior_gen) > (threshold_val))
 51

Credits for title image: Yeonsang