Gosset part 2: small sample statistics
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Simulation was the key to to achieve world beer dominance.
‘Scientific’ Brewing at scale in the early 1900s
This post is an explainer about the small sample experiments performed by
William S. Gosset. This post contains some R code that simulates his simulations^{1}
and the resulting determination of the ideal sample size for inference.
If you brew your own beer, or if you want to know how many samples you need to
say something useful about your data, this post is for you.
I am a big fan of Gosset, look only at my work room:
and so I am absolutely fascinated with the work that Gosset did. In fact I think
he is the first data scientist.
If you are not interested in how beer is made, and just in the simulation go to the heading simulation.
Brewing beer (at scale)
One of the problems William S. Gosset worked on was determining the quality
of Malt. To brew beer you need at least 3
ingredients, yeast, hops and a cereal grain^{2}. On a high level it is very simple: You start with extracting the starch
from the grain into water. You then flavor the resulting sweet liquid with
hops and ferment that mixture with yeast. Put it in a barrel or bottles and you are done!
This is quite doable for small batches, but boy does it get difficult at scale!
Gosset’s work touched on all three ingredients of beer, but in this post I will look into
the cereal grain.
Beer, malts and beer technique
Now, beer brewing has been done since at least the 6th century BCE^{3}, and all the steps in beer brewing
have their own specialized names^{4} , which are different in all languages. So I will
be talking about malt, but remember that it is just a source of starch, a
source of sugars (I’m sorry chemists/ biologist, I have to simplify this a bit).
The source of starch in beer is barley,
this grain is first dried, cleaned and then wetted. This starts the biological
process of germination in the grain (it gets ready to start a new barley plant).
In that process some enzymes are made that will break down starch to smaller
sugars. The grains are now ready to start a new plant and, then we sort of kill that process by forced drying. So we end up with sad half germinated barley. This half germinated barley is called malt or
malted barley.
Grain, Barley, transforming into malt, as seen here.
Remember that we use the barley for its starch, which is converted into sugar. Sugar is converted by yeast into alcohol. Therefore the amount of sugar in the mixture determines the maximum amount of alcohol the
yeast can create. Higher alcohol content keeps the beer for longer
periods, but it also changes the taste of the beer.
The Guinness brewers wanted consistent taste and consistent alcohol levels.
Too high alcohol levels would increase the tax Guinness was paying the British
government. A lower level of alcohol would make
the beer go bad earlier and crucially, customers would go to a competitor, and
the customers might hate you.
So consistency is key if you want people to keep drinking your beer^{5}.
Malt and sugars
Guinness Malt in Gosset’s time, came from Irish and English Barley stock.
Since the sugar content in malt can vary from batch to batch (more or less sun, different amounts of rain etc.), and we determined that you want the exact same taste in all your beer, brewers need to
check the sugar content of the barley. In this time brewers were checking the sugar
content per batch manually by sniffing the barley, crumbling the material and visually
checking it.
However there are only so many brewers and checking every batch is not
scalable. The Guinness brewery was one of the
largest in the world and there were simply not enough master brewers.
The sugar content of malt extract was measured by degrees saccharine per barrel
of 168 pounds malt weight. In earlier experiments the brewers determined that
an extract around 133 degrees saccharine gave the desired alcohol level.
So you better make sure the malt extract sugar content is close to 133 degrees, if you want consistent beer.
In Gosset’s view, 0.5 degrees was a difference or error in malt extract level
which Guinness and its customers could accept.
“It might be maintained,” he said, that malt extract “should be [estimated]
within .5 of the true result with a probability of 10 to 1” – Gosset, according to Guinessometrics – Stephen T. Ziliak (see full reference below)
However how can we be certain that a barrel has a sugar content of 133 degrees?
They could take samples, and average those, but how many samples should you average
to give
enough certainty (that is with odds of 10 to 1 that the sample average is
within 0.5 degree of the real value)?
And every sample takes time and destroys a that sample from the barrel.
Simulation
So how do we find out how many samples you need, to have an accurate estimation
of your real value of the barrel?
Gosset and his co workers actually used simulation;
From one representative barrel of malt extract they had taken a lot of samples.
Gosset^{6} simulated what would
happen if they drew and averaged multiple samples.
 What if we take 2 samples and average that out, is that close enough to the real value?
 What if we take 3?
 etc?
By simulating these draws from one barrel^{7} they generalized this pattern to
what would happen if they sampled from the all of the barrels.
And that is what I will show in R code.
So we are talking about sampling, but maybe not this kind of sampling…
Simulating drawing from a barrel
First the goal in clearer language.
We want to know how many samples you have to draw to know what the real degree saccharine level of the barrel is. With a 10 to 1 odds of being within 0.5 degree.
In economic terms: You are running a large brewery and you want to make sure that the
beer you produce has the same alcohol content in every batch. Therefore you have
to check the barrels of malt extract. How many samples do you have to get from every barrel to be certain enough of its actual sugar content?
Let’s first transform that odds to probabilities, because I don’t know what
those odds mean^{8}. A 10 to 1 odds means ten successes and one failure, so it is really
10 out of 11 successes, 10/11 gives us a probability of approximately 0.909.
A coding example
Let’s create a vector of samples from the same barrel of malt extract,
sample from those samples, take the average, see if we are within the
range that Gosset defined, calculate how many times the sample was a correct
representation of the barrel and finally determine how many samples are enough.
library(tidyverse) # I am just lazy, purrr, and tibble and dplyr are enough
First we create a dataset.
Say we have a large amount of Malt extract
Like the brewers at Guinness we have taken many many many many samples from one barrel, and so we know
what the actual saccharine level of the barrel is. This is relevant for simulation but
don’t forget that we don’t know the truth when we are working with actual data.
So this degrees__sacharine
vector represents 3000 samples from one barrel of malt extract.
set.seed(7334) degrees_sacharine = rnorm(3000,mean = 133, sd = 0.6) # this is really just a guess
Then I create some functions to take a sample, a function to determine if that value is within the range, and finally I combine those functions.
take_sample_average < function(bag= degrees_sacharine, sample_size){ mean(sample(bag, sample_size, replace= TRUE)) } within_range < function(value){ ifelse(value > 132.5 & value < 133.5, TRUE, FALSE) } is_sample_within_range < function(sample_size){ take_sample_average(bag = degrees_sacharine, sample_size = sample_size) %>% within_range() }
For example:
So now we take 2 samples out of the bag, and get (142.2745, 119.4484)/2 = 142.2745.
Is this within the range of 132.5 and 133.5? no.
# and what if we take 3 samples? take_sample_average(bag = degrees_sacharine, 3) ## [1] 132.8
But how often, on average am I within the real value of the bag?
We simulate taking 2 to 15 samples from the barrel and averaging per sample.
sampling_experiments < tibble( N = seq_len(2500) ) %>% mutate( # there is probably a more concise way to do this sample_02 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 2)), sample_03 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 3)), sample_04 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 4)), sample_05 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 5)), sample_06 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 6)), sample_07 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 7)), sample_08 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 8)), sample_09 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 9)), sample_10 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 10)), sample_15 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 15)), sample_20 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 20)), )
So how many times are the samples within the range?
sampling_experiments %>% summarize_all(.funs = ~sum(.)/n()) %>% # this doesn't read that well # So I reshape the data, remove the N column and remove the sample_ part gather(key = "sample_size", value = "prob_in_range", sample_02:sample_20) %>% select(N) %>% mutate(sample_size = str_remove(sample_size,pattern = "sample_")) ## # A tibble: 11 x 2 ## sample_size prob_in_range #### 1 02 0.755 ## 2 03 0.854 ## 3 04 0.901 ## 4 05 0.932 ## 5 06 0.954 ## 6 07 0.966 ## 7 08 0.982 ## 8 09 0.986 ## 9 10 0.991 ## 10 15 0.999 ## 11 20 0.999
Gosset found in his experiments that he needed at least 4 samples for a
estimation with an odds of at least 10 to 1, which is a probability of
approximately 0.909.
In our case for our bag of estimations we would say we need at least 5 samples to
get these odds or better.
Practical results
Armed with this knowledge the Guinness brewery knew it could test the malt extract barrels
by taking 4 samples out of every batch to get an approximation of the true sugar
content of a batch that would be correct in 10 out of 11 times.
That meant that the brewery could use this technique to check the barrels of malt
extract in a chemical way, in stead of a master brewer sampling, and manually investigating the malt or barley.
You can scale the number of tests, but not the amount of brewers / checkers.
From beer to general statistics
The Guinness owners were happy, Gosset was probably too. But he realized there must be a
systematic way to determine how sure we can be about the values in the sample
compared with the true value. He took a year sabbatical to work with statistician
Karl Pierson on this problem. He found a relation that we can approximate the
‘true’ mean and standard deviation based on the sample mean and sample standard
deviation as a function of
the sample size.
And that is what we today, call the tdistribution.
Read more:
 Wikipedia page about Gosset
 The Guinness Brewer who revolutionized Statistics – Dan Kopf
 Student’s Collected Papers – Pearson E. S. 1943s
 Retrospectives: Guinnessometrics: The Economic Foundation of “Student’s” t – Stephen T. Ziliak
 Fascinating read about Malt extract
 wikipedia article about brewing beer
 wikipedia article about Student’s ttest
Image credits
 Hand full of sprouted malt, wikipedia commons, photographer Peter Schill
 detail of malt, author Pierrealain dorange
 Beer image from unsplash
 another barley image from unsplash
State of the machine
At the moment of creation (when I knitted this document ) this was the state of my machine: click here to expand
sessioninfo::session_info() ## ─ Session info ────────────────────────────────────────────────────────── ## setting value ## version R version 3.6.1 (20190705) ## os macOS Mojave 10.14.6 ## system x86_64, darwin15.6.0 ## ui X11 ## language (EN) ## collate en_US.UTF8 ## ctype en_US.UTF8 ## tz Europe/Amsterdam ## date 20191011 ## ## ─ Packages ────────────────────────────────────────────────────────────── ## package * version date lib source ## assertthat 0.2.1 20190321 [1] CRAN (R 3.6.0) ## backports 1.1.5 20191002 [1] CRAN (R 3.6.0) ## blogdown 0.16 20191001 [1] CRAN (R 3.6.0) ## bookdown 0.14 20191001 [1] CRAN (R 3.6.0) ## broom 0.5.2 20190407 [1] CRAN (R 3.6.0) ## cellranger 1.1.0 20160727 [1] CRAN (R 3.6.0) ## cli 1.1.0 20190319 [1] CRAN (R 3.6.0) ## colorspace 1.41 20190318 [1] CRAN (R 3.6.0) ## crayon 1.3.4 20170916 [1] CRAN (R 3.6.0) ## digest 0.6.21 20190920 [1] CRAN (R 3.6.0) ## dplyr * 0.8.3 20190704 [1] CRAN (R 3.6.0) ## ellipsis 0.3.0 20190920 [1] CRAN (R 3.6.0) ## evaluate 0.14 20190528 [1] CRAN (R 3.6.0) ## fansi 0.4.0 20181005 [1] CRAN (R 3.6.0) ## forcats * 0.4.0 20190217 [1] CRAN (R 3.6.0) ## generics 0.0.2 20181129 [1] CRAN (R 3.6.0) ## ggplot2 * 3.2.1 20190810 [1] CRAN (R 3.6.0) ## glue 1.3.1 20190312 [1] CRAN (R 3.6.0) ## gtable 0.3.0 20190325 [1] CRAN (R 3.6.0) ## haven 2.1.1 20190704 [1] CRAN (R 3.6.0) ## hms 0.5.1 20190823 [1] CRAN (R 3.6.0) ## htmltools 0.4.0 20191004 [1] CRAN (R 3.6.0) ## httr 1.4.1 20190805 [1] CRAN (R 3.6.0) ## jsonlite 1.6 20181207 [1] CRAN (R 3.6.0) ## knitr 1.25 20190918 [1] CRAN (R 3.6.0) ## lattice 0.2038 20181104 [1] CRAN (R 3.6.1) ## lazyeval 0.2.2 20190315 [1] CRAN (R 3.6.0) ## lifecycle 0.1.0 20190801 [1] CRAN (R 3.6.0) ## lubridate 1.7.4 20180411 [1] CRAN (R 3.6.0) ## magrittr 1.5 20141122 [1] CRAN (R 3.6.0) ## modelr 0.1.5 20190808 [1] CRAN (R 3.6.0) ## munsell 0.5.0 20180612 [1] CRAN (R 3.6.0) ## nlme 3.1141 20190801 [1] CRAN (R 3.6.0) ## pillar 1.4.2 20190629 [1] CRAN (R 3.6.0) ## pkgconfig 2.0.3 20190922 [1] CRAN (R 3.6.0) ## purrr * 0.3.2 20190315 [1] CRAN (R 3.6.0) ## R6 2.4.0 20190214 [1] CRAN (R 3.6.0) ## Rcpp 1.0.2 20190725 [1] CRAN (R 3.6.0) ## readr * 1.3.1 20181221 [1] CRAN (R 3.6.0) ## readxl 1.3.1 20190313 [1] CRAN (R 3.6.0) ## rlang 0.4.0 20190625 [1] CRAN (R 3.6.0) ## rmarkdown 1.16 20191001 [1] CRAN (R 3.6.0) ## rstudioapi 0.10 20190319 [1] CRAN (R 3.6.0) ## rvest 0.3.4 20190515 [1] CRAN (R 3.6.0) ## scales 1.0.0 20180809 [1] CRAN (R 3.6.0) ## sessioninfo 1.1.1 20181105 [1] CRAN (R 3.6.0) ## stringi 1.4.3 20190312 [1] CRAN (R 3.6.0) ## stringr * 1.4.0 20190210 [1] CRAN (R 3.6.0) ## tibble * 2.1.3 20190606 [1] CRAN (R 3.6.0) ## tidyr * 1.0.0 20190911 [1] CRAN (R 3.6.0) ## tidyselect 0.2.5 20181011 [1] CRAN (R 3.6.0) ## tidyverse * 1.2.1 20171114 [1] CRAN (R 3.6.0) ## utf8 1.1.4 20180524 [1] CRAN (R 3.6.0) ## vctrs 0.2.0 20190705 [1] CRAN (R 3.6.0) ## withr 2.1.2 20180315 [1] CRAN (R 3.6.0) ## xfun 0.10 20191001 [1] CRAN (R 3.6.0) ## xml2 1.2.2 20190809 [1] CRAN (R 3.6.0) ## yaml 2.2.0 20180725 [1] CRAN (R 3.6.0) ## zeallot 0.1.0 20180128 [1] CRAN (R 3.6.0) ## ## [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

see what I did there?↩

If you are fancy you can add other cool ingredients, if you are purist and you keep to the 1516 ‘reinheidsgebod’ you only use barley, water and hops, yeast is necessary but not mentioned in the rules, because it’s existence wasn’t known, maybe the thought it was generatia spontanae or something?↩

The secret to create alcohol out of literally everything, has been rediscovered again and again!↩

from the Wikipedia article I found the words: wort=the sugary liquid, mashing = mixing malted barley with hot water, liquor = hot water with sugar in it, grist =crushed malt, sparging = washing of grains, lautering = separation of wort with grain itself↩

Heineken beer may taste like piss to some people but at least it consistently tastes like piss everywhere you drink it, I think I’m allowed to say that about a Dutch product right?↩

And coworkers, because boy this must have taken some time, and think about the calculations! all done by hand.↩

simulated because the samples had already been done so it was more a choose one of the values at random from this piece of paper↩

maybe this is a country specific thing, in the UK everyone seems to know
about odds and betting, but I can’t get my head around it↩
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.