# Simulating the Rule of Five

**Data Driven Security**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

SIRAcon, the annual conference for the Society of Information Risk
Analysts was held September 9th and 10th. I was fortunate enough to not
only attend, but I also spoke (on how to stop worrying and love math)
and moderated a fantastic panel with
Ali-Samad-Khan, Jack
Jones and Doug
Hubbard. It was a
fantastic 2-day conference and not just because of the speakers. The
small venue and schedule enabled multiple networking opportunities and
side conversations. One of my conversation was with Doug Hubbard where
the *Rule of Five* came up from his first book.

The “Rule of Five” states, “There is a 93.75% chance that the median of a population is between the smallest and largest values in any random sample of five from that population.” And even though the math on it is fairly easy, I found it much more fun to think through how to simulate it. There may be times when the math is either too complex (relative term, I know), but creating a simulation may be the only option.

Let’s think through this, first we will need a population and the median of that population. We will draw our population from a normal distribution, but it could be any type of distribution, since we are working with the median, it doesn’t matter.

set.seed(1) # make it repeatable pop <- rnorm(1000) # generate the "population" med <- median(pop) # calculate the median

Next, we will want to repeatedly draw 5 random samples from the population and check if the median from the population is in the range (between the minimum and maximum) of the 5 samples. According to to setup, we should expect 93.75% of the samples to contain the median.

ssize <- 100000 # how many trials to run matched <- sapply(seq(ssize), function(i) { rg <- range(sample(pop, 5)) # get the range for 5 random samples ifelse(med>=rg[1] & med<=rg[2], TRUE, FALSE) # test if median in range }) sum(matched)/ssize # proportion matched ## [1] 0.9383

That’s pretty close to 93.75%, but what if we broke this up across multiple populations (and sizes) and used many iterations? Let’s create a simluation where we generate populations of varying sizes, capture a whole bunch of these and then plot the output.

# first set up a function to do the sampling pickfive <- function(popsize, ssize) { pop <- rnorm(popsize) med <- median(pop) matched <- sapply(seq(ssize), function(i) { rg <- range(sample(pop, 5)) ifelse(med>=rg[1] & med<=rg[2], TRUE, FALSE) }) sum(matched)/ssize } # test it pickfive(popsize=1000, ssize=100000) ## [1] 0.9381

Now we can create a loop to call the function over and over again.

# 1,000 to 1 million by a thousand set.seed(1) possible <- seq(1000, 1000000, by=1000) output <- sapply(possible, pickfive, 5000) # takes a while print(mean(output)) ## [1] 0.9377

And we can visualize the variations in the samples:

There are so many interesting concepts being touched on here. First,
this is a great example of the Law of Large
Numbers — the more
iterations we do, the closer to reality our estimation will be. Also,
the results will form a normal distribution around the true answer, no
matter what the underlying population looks like, that’s why taking the
mean works. Finally, this also hints at a concept underlying many
machine learning algorithms: combining multiple ‘weak predictors’ is
more accurate than one (or handful) of strong predictors. By taking the
mean of all our outputs (even the samples that are *way* off) we are
using all the samples to derive a more accurate estimate of true value.

### Oh yeah, the math…

The math here is so much simpler than all the stuff I did above (but so
much less exciting). Since we talking about the median, 50% of the
samples should be above and below the median value. This sets a
coin-toss analogy: if we say heads is above the median and tails is
below, what is the probability of flipping either 5 heads or 5 tails in
a row? It is the same thing, we want to know the probability that all 5 of
our samples will be above (or below) the median. The math is simply 50%
to the power of 5 for getting either 5 tails or heads in row.
So we can calculate it once and double it (once for heads, twice for tails).
Then we want to know what
the probability is of **not** getting heads, so we subtract it from 1.

# probability of getting 5 heads in a row prob_of_heads <- 0.5 ^ 5 # == 0.03125 # probability of getting 5 heads or 5 tails in a row prob_of_heads_or_tails <- 2 * prob_of_heads # == 0.0625 # probability of NOT getting 5 heads or tails in a row rule_of_five <- (1 - prob_of_heads_or_tails) # yup, 0.9375 or 93.75%

Or, if we want to wrap this whole discussion up into a single line of code:

print(1 - 2 * 0.5^5) ## [1] 0.9375

I told ya the math was easy! But simluations likes this are fun to do
and come in handy if you don’t know (or can’t remember) the math. Plus
with this, you should be to able to now go simulate the “Urn of Mystery”
(chapter 3, 3rd edition of *How to Measure
Anything*).
Or better yet, use this skill of simulation to finally prove to yourself
how the infamous Monte Hall
problem actually
works
out
like the math says it should.

**leave a comment**for the author, please follow the link and comment on their blog:

**Data Driven Security**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.