Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Everybody is talking about big data but the real skill lies in the art of inferring useful information from only a handful of values!

If you want to learn how to determine the range of the typical value of a dataset (i.e. the median) with just five values and why this works, read on!

This blog post is inspired by a chapter from the wonderful book “Alles kein Zufall! Liebe, Geld, Fußball” (“No coincidence! Love, Money, Football”, only available in German at the moment) by my colleague Professor Christian Hesse from the University of Stuttgart, Germany.

Let us dive directly into the matter, the Small Data Rule states:

In a sample of five numerical values from any unknown population, the median of this population lies between the smallest and the largest sample value with 94 percent certainty.

The “population” can be anything, like data about age in a population, income in a country, television consumption, donation amounts, body sizes, temperatures and so on.

The median is the “middle value” and thereby a good representation of a population’s “typical value”. It is calculated by sorting all of the values and then dividing them into two halves of the same size. The value that lies exactly between those two halves is the median. Contrary to the mean (often simply called the “average”) the median is robust with regard to outliers:

x <- 0:10
median(x)
## [1] 5

mean(x)
## [1] 5

x <- c(0:9, 10000)
median(x)
## [1] 5

mean(x)
## [1] 913.1818


Obviously, the median is quite useful for getting a quick overview of a large dataset. So, it seems almost magical that you could determine the range of it by just five randomly drawn numbers. Yet, the rationale is quite straightforward:

The probability of drawing a random value from a population that is above the median is 50 percent or 1/2. The probability that all five values are above the median is 1/2 x 1/2 x 1/2 x 1/2 x 1/2. Of course, this is the same probability that all of those values are below the median. To cover both cases just add those probabilities.

But we are interested in the complementary event, i.e. that at least one value lies on each side of the median so that we get an interval that encloses it. We get that by subtracting the above probability from one:

1-2*(0.5^5)
## [1] 0.9375


The result is a high degree of certainty of nearly 94% that this will indeed be the case!

If you don’t believe this let us conduct a little experiment for illustrative purposes. We enumerate all possibilities of drawing five values from the range of zero to one hundred and see how often the median (= 50) falls within the interval of the minimum and the maximum of the samples (to understand how to do this, this post might be helpful: Learning R: Permutations and Combinations with Base R).

Beware, the following code will run for quite a while (about three to four minutes on an average computer) because there are nearly 80 million possibilities that have to be created and after that evaluated:

M <- combn(0:100, 5)
between <- apply(M, 2, \(x) min(x) < 50 && max(x) > 50)
sum(between) / ncol(M)
## [1] 0.9406869


As you can see: 94% indeed! (The resulting value is not exactly the same as above because it only asymptotically reaches that value the bigger the underlying population is.)

Professor Hesse gives a nice example of how to use the small data rule in practice:

The manager of a company is interested in the distance his employees have to commute to work. He plans to open another branch if the distances are too long for many. He could, of course, ask his entire staff about the distance to their place of residence. That would be costly, generate a lot of data and provide more information than the manager actually needs. Instead, he surveys only five randomly selected employees. They live 7, 19, 13, 18, and 9 km away from the company. Thus, the manager can be 94 percent sure that his employees have to travel an average of 7 to 19 kilometres to the company. He considers this acceptable and decides against an additional location.

As an aside, not many people know the range function which might come in handy in contexts like these:

range(c(7, 19, 13, 18, 9))
## [1]  7 19


So you see, small data can help you determine the big picture!

For another handy tool to infer whether something unusual is going on see this post: 3.84 or: How to Detect BS (Fast).