Gosset part 2: small sample statistics

[This article was first published on Category R on Roel's R-tefacts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Simulation was the key to to achieve world beer dominance.

‘Scientific’ Brewing at scale in the early 1900s

Beer bottles cheers

Beer bottles cheers

This post is an explainer about the small sample experiments performed by William S. Gosset. This post contains some R code that simulates his simulations1 and the resulting determination of the ideal sample size for inference.

If you brew your own beer, or if you want to know how many samples you need to say something useful about your data, this post is for you.

I am a big fan of Gosset, look only at my work room:

Yes that is the wikipedia portret of William Gosset! and so I am absolutely fascinated with the work that Gosset did. In fact I think he is the first data scientist.

If you are not interested in how beer is made, and just in the simulation go to the heading simulation.

Brewing beer (at scale)

One of the problems William S. Gosset worked on was determining the quality of Malt. To brew beer you need at least 3 ingredients, yeast, hops and a cereal grain2. On a high level it is very simple: You start with extracting the starch from the grain into water. You then flavor the resulting sweet liquid with hops and ferment that mixture with yeast. Put it in a barrel or bottles and you are done! This is quite doable for small batches, but boy does it get difficult at scale!

Gosset’s work touched on all three ingredients of beer, but in this post I will look into the cereal grain.

Beer, malts and beer technique

Now, beer brewing has been done since at least the 6th century BCE3, and all the steps in beer brewing have their own specialized names4 , which are different in all languages. So I will be talking about malt, but remember that it is just a source of starch, a source of sugars (I’m sorry chemists/ biologist, I have to simplify this a bit).

The source of starch in beer is barley, this grain is first dried, cleaned and then wetted. This starts the biological process of germination in the grain (it gets ready to start a new barley plant). In that process some enzymes are made that will break down starch to smaller sugars. The grains are now ready to start a new plant and, then we sort of kill that process by forced drying. So we end up with sad half germinated barley. This half germinated barley is called malt or malted barley.

Grain, Barley, transforming into malt, as seen here. Malt and grain, detail of individual sprouted malt

Remember that we use the barley for its starch, which is converted into sugar. Sugar is converted by yeast into alcohol. Therefore the amount of sugar in the mixture determines the maximum amount of alcohol the yeast can create. Higher alcohol content keeps the beer for longer periods, but it also changes the taste of the beer. The Guinness brewers wanted consistent taste and consistent alcohol levels.

Too high alcohol levels would increase the tax Guinness was paying the British government. A lower level of alcohol would make the beer go bad earlier and crucially, customers would go to a competitor, and the customers might hate you.

Customers would maybe go to a competitor when I lower the alcohol content?

Customers would maybe go to a competitor when I lower the alcohol content?

So consistency is key if you want people to keep drinking your beer5.

Malt and sugars

Barley in the sun Guinness Malt in Gosset’s time, came from Irish and English Barley stock. Since the sugar content in malt can vary from batch to batch (more or less sun, different amounts of rain etc.), and we determined that you want the exact same taste in all your beer, brewers need to check the sugar content of the barley. In this time brewers were checking the sugar content per batch manually by sniffing the barley, crumbling the material and visually checking it.

However there are only so many brewers and checking every batch is not scalable. The Guinness brewery was one of the largest in the world and there were simply not enough master brewers.

The sugar content of malt extract was measured by degrees saccharine per barrel of 168 pounds malt weight. In earlier experiments the brewers determined that an extract around 133 degrees saccharine gave the desired alcohol level.

So you better make sure the malt extract sugar content is close to 133 degrees, if you want consistent beer. In Gosset’s view, 0.5 degrees was a difference or error in malt extract level which Guinness and its customers could accept.

“It might be maintained,” he said, that malt extract “should be [estimated] within .5 of the true result with a probability of 10 to 1” – Gosset, according to Guinessometrics – Stephen T. Ziliak (see full reference below)

However how can we be certain that a barrel has a sugar content of 133 degrees?

They could take samples, and average those, but how many samples should you average to give enough certainty (that is with odds of 10 to 1 that the sample average is within 0.5 degree of the real value)? And every sample takes time and destroys a that sample from the barrel.


Malt, green, a handful of green malt So how do we find out how many samples you need, to have an accurate estimation of your real value of the barrel? Gosset and his co workers actually used simulation; From one representative barrel of malt extract they had taken a lot of samples. Gosset6 simulated what would happen if they drew and averaged multiple samples.

  • What if we take 2 samples and average that out, is that close enough to the real value?
  • What if we take 3?
  • etc?

By simulating these draws from one barrel7 they generalized this pattern to what would happen if they sampled from the all of the barrels.

And that is what I will show in R code.

So we are talking about sampling, but maybe not this kind of sampling… Randy Marshal sampling beer and wine (South Park)

Simulating drawing from a barrel

First the goal in clearer language. We want to know how many samples you have to draw to know what the real degree saccharine level of the barrel is. With a 10 to 1 odds of being within 0.5 degree.

In economic terms: You are running a large brewery and you want to make sure that the beer you produce has the same alcohol content in every batch. Therefore you have to check the barrels of malt extract. How many samples do you have to get from every barrel to be certain enough of its actual sugar content?

Let’s first transform that odds to probabilities, because I don’t know what those odds mean8. A 10 to 1 odds means ten successes and one failure, so it is really 10 out of 11 successes, 10/11 gives us a probability of approximately 0.909.

A coding example

Let’s create a vector of samples from the same barrel of malt extract, sample from those samples, take the average, see if we are within the range that Gosset defined, calculate how many times the sample was a correct representation of the barrel and finally determine how many samples are enough.

library(tidyverse) # I am just lazy, purrr, and tibble and dplyr are enough

First we create a dataset.
Say we have a large amount of Malt extract Like the brewers at Guinness we have taken many many many many samples from one barrel, and so we know what the actual saccharine level of the barrel is. This is relevant for simulation but don’t forget that we don’t know the truth when we are working with actual data.

So this degrees__sacharine vector represents 3000 samples from one barrel of malt extract.

degrees_sacharine = rnorm(3000,mean = 133, sd = 0.6) # this is really just a guess

Then I create some functions to take a sample, a function to determine if that value is within the range, and finally I combine those functions.

take_sample_average <- function(bag= degrees_sacharine, sample_size){
  mean(sample(bag, sample_size, replace= TRUE))
within_range <- function(value){
  ifelse(value > 132.5 & value < 133.5, TRUE, FALSE)
is_sample_within_range <- function(sample_size){
  take_sample_average(bag = degrees_sacharine, sample_size = sample_size) %>%

For example: So now we take 2 samples out of the bag, and get (142.2745, 119.4484)/2 = 142.2745. Is this within the range of 132.5 and 133.5? no.

#  and what if we take 3 samples?
take_sample_average(bag = degrees_sacharine, 3)
## [1] 132.8

But how often, on average am I within the real value of the bag? We simulate taking 2 to 15 samples from the barrel and averaging per sample.

sampling_experiments <-  
  N = seq_len(2500)
) %>%
  mutate( # there is probably a more concise way to do this
    sample_02 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 2)),
    sample_03 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 3)),
    sample_04 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 4)),
    sample_05 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 5)),
    sample_06 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 6)),
    sample_07 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 7)),
    sample_08 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 8)),
    sample_09 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 9)),
    sample_10 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 10)),
    sample_15 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 15)),
    sample_20 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 20)),

So how many times are the samples within the range?

sampling_experiments %>%
  summarize_all(.funs = ~sum(.)/n()) %>% # this doesn't read that well
  # So I reshape the data, remove the N column and remove the sample_ part
  gather(key = "sample_size", value = "prob_in_range", sample_02:sample_20) %>%
  select(-N) %>%
  mutate(sample_size = str_remove(sample_size,pattern =  "sample_"))
## # A tibble: 11 x 2
##    sample_size prob_in_range
##    <chr>               <dbl>
##  1 02                  0.755
##  2 03                  0.854
##  3 04                  0.901
##  4 05                  0.932
##  5 06                  0.954
##  6 07                  0.966
##  7 08                  0.982
##  8 09                  0.986
##  9 10                  0.991
## 10 15                  0.999
## 11 20                  0.999

Gosset found in his experiments that he needed at least 4 samples for a estimation with an odds of at least 10 to 1, which is a probability of approximately 0.909.

In our case for our bag of estimations we would say we need at least 5 samples to get these odds or better.

Practical results

Armed with this knowledge the Guinness brewery knew it could test the malt extract barrels by taking 4 samples out of every batch to get an approximation of the true sugar content of a batch that would be correct in 10 out of 11 times.

That meant that the brewery could use this technique to check the barrels of malt extract in a chemical way, in stead of a master brewer sampling, and manually investigating the malt or barley. You can scale the number of tests, but not the amount of brewers / checkers.

From beer to general statistics

The Guinness owners were happy, Gosset was probably too. But he realized there must be a systematic way to determine how sure we can be about the values in the sample compared with the true value. He took a year sabbatical to work with statistician Karl Pierson on this problem. He found a relation that we can approximate the ‘true’ mean and standard deviation based on the sample mean and sample standard deviation as a function of the sample size.

And that is what we today, call the t-distribution.

Image credits

State of the machine

At the moment of creation (when I knitted this document ) this was the state of my machine: click here to expand

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.1 (2019-07-05)
##  os       macOS Mojave 10.14.6        
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Europe/Amsterdam            
##  date     2019-10-11                  
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
##  backports     1.1.5   2019-10-02 [1] CRAN (R 3.6.0)
##  blogdown      0.16    2019-10-01 [1] CRAN (R 3.6.0)
##  bookdown      0.14    2019-10-01 [1] CRAN (R 3.6.0)
##  broom         0.5.2   2019-04-07 [1] CRAN (R 3.6.0)
##  cellranger    1.1.0   2016-07-27 [1] CRAN (R 3.6.0)
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.6.0)
##  colorspace    1.4-1   2019-03-18 [1] CRAN (R 3.6.0)
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
##  digest        0.6.21  2019-09-20 [1] CRAN (R 3.6.0)
##  dplyr       * 0.8.3   2019-07-04 [1] CRAN (R 3.6.0)
##  ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.6.0)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)
##  fansi         0.4.0   2018-10-05 [1] CRAN (R 3.6.0)
##  forcats     * 0.4.0   2019-02-17 [1] CRAN (R 3.6.0)
##  generics      0.0.2   2018-11-29 [1] CRAN (R 3.6.0)
##  ggplot2     * 3.2.1   2019-08-10 [1] CRAN (R 3.6.0)
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.6.0)
##  gtable        0.3.0   2019-03-25 [1] CRAN (R 3.6.0)
##  haven         2.1.1   2019-07-04 [1] CRAN (R 3.6.0)
##  hms           0.5.1   2019-08-23 [1] CRAN (R 3.6.0)
##  htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.0)
##  httr          1.4.1   2019-08-05 [1] CRAN (R 3.6.0)
##  jsonlite      1.6     2018-12-07 [1] CRAN (R 3.6.0)
##  knitr         1.25    2019-09-18 [1] CRAN (R 3.6.0)
##  lattice       0.20-38 2018-11-04 [1] CRAN (R 3.6.1)
##  lazyeval      0.2.2   2019-03-15 [1] CRAN (R 3.6.0)
##  lifecycle     0.1.0   2019-08-01 [1] CRAN (R 3.6.0)
##  lubridate     1.7.4   2018-04-11 [1] CRAN (R 3.6.0)
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.0)
##  modelr        0.1.5   2019-08-08 [1] CRAN (R 3.6.0)
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 3.6.0)
##  nlme          3.1-141 2019-08-01 [1] CRAN (R 3.6.0)
##  pillar        1.4.2   2019-06-29 [1] CRAN (R 3.6.0)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 3.6.0)
##  purrr       * 0.3.2   2019-03-15 [1] CRAN (R 3.6.0)
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.6.0)
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.6.0)
##  readr       * 1.3.1   2018-12-21 [1] CRAN (R 3.6.0)
##  readxl        1.3.1   2019-03-13 [1] CRAN (R 3.6.0)
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.6.0)
##  rmarkdown     1.16    2019-10-01 [1] CRAN (R 3.6.0)
##  rstudioapi    0.10    2019-03-19 [1] CRAN (R 3.6.0)
##  rvest         0.3.4   2019-05-15 [1] CRAN (R 3.6.0)
##  scales        1.0.0   2018-08-09 [1] CRAN (R 3.6.0)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.6.0)
##  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 3.6.0)
##  tibble      * 2.1.3   2019-06-06 [1] CRAN (R 3.6.0)
##  tidyr       * 1.0.0   2019-09-11 [1] CRAN (R 3.6.0)
##  tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.6.0)
##  tidyverse   * 1.2.1   2017-11-14 [1] CRAN (R 3.6.0)
##  utf8          1.1.4   2018-05-24 [1] CRAN (R 3.6.0)
##  vctrs         0.2.0   2019-07-05 [1] CRAN (R 3.6.0)
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.0)
##  xfun          0.10    2019-10-01 [1] CRAN (R 3.6.0)
##  xml2          1.2.2   2019-08-09 [1] CRAN (R 3.6.0)
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.6.0)
##  zeallot       0.1.0   2018-01-28 [1] CRAN (R 3.6.0)
## [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

  1. see what I did there?

  2. If you are fancy you can add other cool ingredients, if you are purist and you keep to the 1516 ‘reinheidsgebod’ you only use barley, water and hops, yeast is necessary but not mentioned in the rules, because it’s existence wasn’t known, maybe the thought it was generatia spontanae or something?

  3. The secret to create alcohol out of literally everything, has been rediscovered again and again!

  4. from the Wikipedia article I found the words: wort=the sugary liquid, mashing = mixing malted barley with hot water, liquor = hot water with sugar in it, grist =crushed malt, sparging = washing of grains, lautering = separation of wort with grain itself

  5. Heineken beer may taste like piss to some people but at least it consistently tastes like piss everywhere you drink it, I think I’m allowed to say that about a Dutch product right?

  6. And coworkers, because boy this must have taken some time, and think about the calculations! all done by hand.

  7. simulated because the samples had already been done so it was more a choose one of the values at random from this piece of paper

  8. maybe this is a country specific thing, in the UK everyone seems to know about odds and betting, but I can’t get my head around it

To leave a comment for the author, please follow the link and comment on their blog: Category R on Roel's R-tefacts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)