Monty Hall problem, with Thompson sampling

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We all know the Monty Hall problem. Recently, Jason Rosenhouse published a book on that topic (entitled The Monty Hall Problem, The Remarkable Story of Math’s Most Contentious Brain Teaser). The game is more or less described by the following question

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

While I was preparing some slides for a lecture on Bayesian modeling and thinking, I wanted to find an illustration of what is sometimes called the Bayesian brain, that can be related to updates of beliefs, when we experience. And I was looking for examples of Thompson sampling. And actually, it is possible to learn that switching is the optimal strategy, in the Monty Hall problem, just by playing sequentially the game, and learning from previous strategies. The following code is used, to choose the door with the price (the car), and the one we first select

n = 5000
listdoor = matrix(1:3,3,n)
door = listdoor
win = sample(1:3,size=n,replace=TRUE)
pick1 = sample(1:3,size=n,replace=TRUE)

Then, the presenter picks one, that is neither the car, nor the one we chose initially. The following trick can be used, to get the list of available choices

door[win+(0:(n-1))*3] = NA
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] NA NA NA 1 NA NA 1 NA 1 NA
[2,] 2 2 NA NA 2 2 2 NA NA 2
[3,] 3 NA 3 3 NA NA NA 3 NA NA
door[pick1+(0:(n-1))*3] = NA
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] NA NA NA 1 NA NA 1 NA 1 NA
[2,] 2 2 NA NA 2 2 2 NA NA 2
[3,] 3 NA 3 3 NA NA NA 3 NA NA

Then, the presenter picks one

presenter = apply(door,2, function(x) sample(x[!],size=1))
> presenter[win != pick1] = apply(door,2,function(x) x[!])[win != pick1] 
presenter = unlist(presenter)
[1] 3 2 3 1 2 2 2 3 1 2

Now, let us consider the  Monty Hall problem. We have two possible strategies. The first one is to keep the door we chose, initially

pick2a = pick1
gaina = (pick2a==win)
[1] 0.3392

As expected, on average, we win with (about) 1 chance out of 3. The second one is to (always) pick the other door (the one left). The code is close to the one we used before

door = listdoor
door[pick1+(0:(n-1))*3] = NA
door[presenter+(0:(n-1))*3] = NA
pick2b = apply(door,2,function(x) x[!])
gainb = (pick2b==win)
[1] 0.6608

If you know Monty Hall problem the probability to win is now 2 chance out of 3 (which is what the maths tells us). That is what we have with simulations.

Now, what if we don’t know how to do the maths, and we don’t want to compute it? We can use Thompson sampling to explore, and exploit. In a general context, we have to choose among On a le choix entre \(K\) alternatives (here \(K=2\), since we can either keep our initial choice, or pick the other one), and the output is \(\boldsymbol{X}=(X_1,\cdots, X_K)\), where \(X_k\sim\mathcal{B}(\theta_k)\), but \(\theta_k\) is unknow, and we will play the game, and learn. From previous computations, we know that \(\theta_1=1/3\) while \(\theta_2=2/3\).

We use some prior distribution, \(\theta_k\sim\mathcal{B}eta(\alpha_k,\beta_k)\), since the Beta distribution is the conjugate of the Bernoulli. At time \(t\), we draw \(K\) (independent) Beta variables \(B_k\sim\mathcal{B}eta(\alpha_k,\beta_k)\), and pick \(k^\star = \displaystyle{\underset{k=1,\cdots,K}{\text{argmax}}\{B_k\}}\).  Here the code will be

X = cbind(pick2a == win,pick2b == win)*1
AB1 = AB2 = tirage = matrix(NA,n,2)
choix = rep(NA,n)
AB1[k,] = AB2[k,] = c(1,1)
for(k in 1:(n-1)){
tirage[k,] = c(rbeta(1,AB1[k,1],AB1[k,2]),
choix[k] = which.max(tirage[k,])
if(choix[k] == 1){
AB1[k+1,] = AB1[k,] + c(X[k,1],1-X[k,1])
AB2[k+1,] = AB2[k,] 
if(choix[k] == 2){
AB1[k+1,] = AB1[k,] 
AB2[k+1,] = AB2[k,] + c(X[k,2],1-X[k,2])

Before showing some graphs, let us check that indeed, we select more the second strategy (which is here to select the other door)

[1] 5 13
[1] 3292 1693

Indeed, since the average of a Beta distribution, \(\mathcal{B}eta(\alpha,\beta)\) is \(\alpha/(\alpha+\beta)\)

[1] 0.6603811

i.e. the probability to win, with this second strategy is about \(2/3\) (as obtained previously). We can visualize this on the animation below, with, in red the first strategy (keep your initial choice), in green the second one (select the other door), 0 and 1 respectively if we win, or not. Then we can visualize the evolution of \(\alpha_2\) and \(\beta_2\) on topc, and \(\alpha_1\) and \(\beta_1\) below (the index is time \(t\)). Finallly, we have the two variables \(B_1\) and \(B_2\) drawn,

Of course, another simulation would have given different \(B_1\)‘s and \(B_2\)‘s, but finally, we learn that the second strategy is better, and we learn it quite fast…

Here is another one (just to confirm)

So clearly, even if we don’t know which is the optimal strategy (keep our initial choice, or switch), a player who played that game about 30 times should be able to understand that switching should be a better strategy.

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)