Simulating the Rule of Five

[This article was first published on Data Driven Security, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

SIRAcon, the annual conference for the Society of Information Risk
Analysts was held September 9th and 10th. I was fortunate enough to not
only attend, but I also spoke (on how to stop worrying and love math)
and moderated a fantastic panel with
Ali-Samad-Khan, Jack
and Doug
. It was a
fantastic 2-day conference and not just because of the speakers. The
small venue and schedule enabled multiple networking opportunities and
side conversations. One of my conversation was with Doug Hubbard where
the Rule of Five came up from his first book.

The “Rule of Five” states, “There is a 93.75% chance that the median of
a population is between the smallest and largest values in any random
sample of five from that population.” And even though the math on it is
fairly easy, I found it much more fun to think through how to simulate
it. There may be times when the math is either too complex (relative
term, I know), but creating a simulation may be the only option.

Let’s think through this, first we will need a population and the median
of that population. We will draw our population from a normal
distribution, but it could be any type of distribution, since we are
working with the median, it doesn’t matter.

set.seed(1)  # make it repeatable
pop <- rnorm(1000)  # generate the "population"
med <- median(pop) # calculate the median

Next, we will want to repeatedly draw 5 random samples from the
population and check if the median from the population is in the range
(between the minimum and maximum) of the 5 samples. According to to
setup, we should expect 93.75% of the samples to contain the median.

ssize <- 100000 # how many trials to run
matched <- sapply(seq(ssize), function(i) {
  rg <- range(sample(pop, 5)) # get the range for 5 random samples
  ifelse(med>=rg[1] & med<=rg[2], TRUE, FALSE) # test if median in range
sum(matched)/ssize # proportion matched

## [1] 0.9383

That’s pretty close to 93.75%, but what if we broke this up across
multiple populations (and sizes) and used many iterations? Let’s create
a simluation where we generate populations of varying sizes, capture a
whole bunch of these and then plot the output.

# first set up a function to do the sampling
pickfive <- function(popsize, ssize) {
  pop <- rnorm(popsize)
  med <- median(pop)
  matched <- sapply(seq(ssize), function(i) {
     rg <- range(sample(pop, 5))
     ifelse(med>=rg[1] & med<=rg[2], TRUE, FALSE)
# test it
pickfive(popsize=1000, ssize=100000)

## [1] 0.9381

Now we can create a loop to call the function over and over again.

# 1,000 to 1 million by a thousand
possible <- seq(1000, 1000000, by=1000)
output <- sapply(possible, pickfive, 5000) # takes a while

## [1] 0.9377

And we can visualize the variations in the samples:

ruleof5 plot

There are so many interesting concepts being touched on here. First,
this is a great example of the Law of Large
— the more
iterations we do, the closer to reality our estimation will be. Also,
the results will form a normal distribution around the true answer, no
matter what the underlying population looks like, that’s why taking the
mean works. Finally, this also hints at a concept underlying many
machine learning algorithms: combining multiple ‘weak predictors’ is
more accurate than one (or handful) of strong predictors. By taking the
mean of all our outputs (even the samples that are way off) we are
using all the samples to derive a more accurate estimate of true value.

Oh yeah, the math…

The math here is so much simpler than all the stuff I did above (but so
much less exciting). Since we talking about the median, 50% of the
samples should be above and below the median value. This sets a
coin-toss analogy: if we say heads is above the median and tails is
below, what is the probability of flipping either 5 heads or 5 tails in
a row? It is the same thing, we want to know the probability that all 5 of
our samples will be above (or below) the median. The math is simply 50%
to the power of 5 for getting either 5 tails or heads in row.
So we can calculate it once and double it (once for heads, twice for tails).
Then we want to know what
the probability is of not getting heads, so we subtract it from 1.

# probability of getting 5 heads in a row
prob_of_heads <- 0.5 ^ 5  # == 0.03125

# probability of getting 5 heads or 5 tails in a row
prob_of_heads_or_tails <- 2 * prob_of_heads # == 0.0625

# probability of NOT getting 5 heads or tails in a row
rule_of_five <- (1 - prob_of_heads_or_tails)  # yup, 0.9375 or 93.75%

Or, if we want to wrap this whole discussion up into a single line of code:

print(1 - 2 * 0.5^5)

## [1] 0.9375

I told ya the math was easy! But simluations likes this are fun to do
and come in handy if you don’t know (or can’t remember) the math. Plus
with this, you should be to able to now go simulate the “Urn of Mystery”
(chapter 3, 3rd edition of How to Measure
Or better yet, use this skill of simulation to finally prove to yourself
how the infamous Monte Hall

like the math says it should.

To leave a comment for the author, please follow the link and comment on their blog: Data Driven Security. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)