**Freakonometrics » R-english**, and kindly contributed to R-bloggers)

Quite frequently, someone on the internet discovers the Monty Hall paradox, and become so enthusiastic that it becomes urgent to publish an article – or a post – about it. The latest example can be http://www.bbc.co.uk/news/magazine-24045598. I won’t blame them, I did the same a few years ago (see http://freakonometrics.hypotheses.org/776, or http://freakonometrics.hypotheses.org/775, in French).

My point today is that the Monty Hall paradox raise an important question, about information. How comes that something to sounds like non-informative can actually be extremely informative. I will not get back on the blue eyes paradox (see http://freakonometrics.hypotheses.org/1963, in French) or the exam paradox (see http://freakonometrics.hypotheses.org/2328, in French one more time), which are related to information, but not with a probabilistic approach. I will stay close to Monty Hall’s paradox today.

This morning, in my probability class, we were looking at a simple exercise (I say *simple* because it is only the second course of the session). The problem was the following

Consider an urn , with 15 blue balls, and 10 red balls, and an urn , with 10 blue balls, and 15 red balls. We select randomly one urn (with probability 50% for each urn).

We draw a ball, which turns out to be blue, and we put it back in the urn, Now, we draw a (second) ball. What is the probability that this (second) ball is blue?

Please, take your time to read that carefully…

Ready? Your first thought should be that since we put back the ball, after the first draw, it does not change the probabilities, right? So, why did we say that? It is necessary? (about the last question, yes, when something is mentioned in an exercise, we should use it).

Let’s forget about this second ball story, as an introduction to this problem. What was, actually, the probability for the first ball to be blue? Trivially, it was

i.e.

Let us run a code to get that, using simulations:

> n=1000000 > set.seed(1)

First, let us draw the urn, randomly

> urn=sample(1:2,size=n,replace=TRUE)

Then, let us draw the first, and the second ball,

> urns=matrix(c(15,10,10,15),2,2) > colnames(urns)=c("blue","red") > sample.urn=(urns[urn,]) > prob.urn=sample.urn/apply(sample.urn,1,sum) > u1=c("blue","red")[1+(runif(n)<prob.urn[,1])] > u2=c("blue","red")[1+(runif(n)<prob.urn[,1])]

The probability that the first ball was blue is here

> sum(u1=="blue")/n [1] 0.499953

and for the second one

> sum(u2=="blue")/n [1] 0.499221

So, indeed, the probability to have a blue ball is 50%. Now, what was the question? Given that the first ball was blue, what it the probability that the second one is blue? Here, on our simulations, it is

> sum(u2[u1=="blue"]=="blue")/sum(u1=="blue") [1] 0.5194088

Which is close to 52%.And if you run more simulations, you get

> f=function(seed){ + set.seed(seed) + urns=matrix(c(15,10,10,15),2,2) + colnames(urns)=c("blue","red") + sample.urn=(urns[urn,]) + prob.urn=sample.urn/apply(sample.urn,1,sum) + u1=c("blue","red")[1+(runif(n)<prob.urn[,1])] + u2=c("blue","red")[1+(runif(n)<prob.urn[,1])] + return(sum(u2[u1=="blue"]=="blue")/ + sum(u1=="blue")) + } > Vectorize(f)(1:20) [1] 0.5194088 0.5200931 0.5203338 0.5192104 0.5196960 0.5206121 0.5195453 [8] 0.5184580 0.5203755 0.5200154 0.5196557 0.5179276 0.5188652 0.5204724 [15] 0.5197437 0.5209244 0.5205770 0.5208725 0.5206228 0.5190711

The probability is always close to 52%, and is (significantly) different from 50%.

Still not convinced that we have some information here that should be used? Imagine that in the first urn, we add 1 blue ball, and 24 red balls; and the opposite in the second one. In that case, if we say that the first ball was blue, it means that it is very likely that the urn chosen was the second one. Let’s look at by it running some simulations

> set.seed(1) > urns=matrix(c(1,24,24,1),2,2) > colnames(urns)=c("blue","red") > sample.urn=(urns[urn,]) > prob.urn=sample.urn/apply(sample.urn,1,sum) > u1=c("blue","red")[1+(runif(n)<prob.urn[,1])] > u2=c("blue","red")[1+(runif(n)<prob.urn[,1])]

As before, the probability that the second ball is blue is 50% (because of the symmetry actually)

> sum(u2=="blue")/n [1] 0.500362

But if I tell you that the first one was blue, the probability that the second one is blue becomes

> sum(u2[u1=="blue"]=="blue")/sum(u1=="blue") [1] 0.9236433

So even if – somehow – we do not change much by replacing the ball in its urn, we do have here some information, since it was mentioned that the ball was blue. And we should use it. Again, the important point is that the sentence was not “we draw a ball and we put it back”, but “we draw a blue ball, and we put it back”. Now, it we do the maths, everything become simple, and clear (as usual).

The question is here to compute

and according to Bayes formula, it is

Now, to compute those two probabilities, we have to condition on the urn,

Given the urn, since we replace the ball,

i.e.

So if we substitute numerical probabilities to get a blue ball in the previous formula, we get

which not the same as

Here, we get

> {(15/25)^2+(10/25)^2}/((15/25)+(10/25)) [1] 0.52

which confirms our empirical 52%, and note that in the second case (where there was only 1 blue ball in one urn, and 24 in the second one)

> {(24/25)^2+(1/25)^2}/((24/25)+(1/25)) [1] 0.9232

which again is close to the empirical 92.3% we got.

I strongly believe that the mis-intuition we might have is close to the one we can observe in Monty Hall paradox. And unless you write things properly, it is difficult to conclude anything….

**leave a comment**for the author, please follow the link and comment on their blog:

**Freakonometrics » R-english**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...