Yet another visualization of the Bayesian Beta-Binomial model

[This article was first published on Publishable Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


The Beta-Binomial model is the “hello world” of Bayesian statistics. That is, it’s the first model you get to run, often before you even know what you are doing. There are many reasons for this:

  • It only has one parameter, the underlying proportion of success, so it’s easy to visualize and reason about.
  • It’s easy to come up with a scenario where it can be used, for example: “What is the proportion of patients that will be cured by this drug?”
  • The model can be computed analytically (no need for any messy MCMC).
  • It’s relatively easy to come up with an informative prior for the underlying proportion.
  • Most importantly: It’s fun to see some results before diving into the theory! 😁

That’s why I also introduced the Beta-Binomial model as the first model in my DataCamp course Fundamentals of Bayesian Data Analysis in R and quite a lot of people have asked me for the code I used to visualize the Beta-Binomial. Scroll to the bottom of this post if that’s what you want, otherwise, here is how I visualized the Beta-Binomial in my course given two successes and four failures:

The function that produces these plots is called prop_model (prop as in proportion) and takes a vector of TRUEs and FALSEs representing successes and failures. The visualization is created using the excellent ggridges package (previously called joyplot). Here’s how you would use prop_model to produce the last plot in the animation above:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">data <span style="color: #666666"><-</span> c(<span style="color: #008000; font-weight: bold">FALSE</span>, <span style="color: #008000; font-weight: bold">TRUE</span>, <span style="color: #008000; font-weight: bold">FALSE</span>, <span style="color: #008000; font-weight: bold">FALSE</span>, <span style="color: #008000; font-weight: bold">FALSE</span>, <span style="color: #008000; font-weight: bold">TRUE</span>)
prop_model(data)
</pre></div>

The result is, I think, a quite nice visualization of how the model’s knowledge about the parameter changes as data arrives. At n=0 the model doesn’t know anything and — as the default prior states that it’s equally likely the proportion of success is anything from 0.0 to 1.0 — the result is a big, blue, and uniform square. As more data arrives the probability distribution becomes more concentrated, with the final posterior distribution at n=6.

Some added features of prop_model is that it also plots larger data somewhat gracefully and that it returns a random sample from the posterior that can be further explored. For example:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">big_data <span style="color: #666666"><-</span> sample(c(<span style="color: #008000; font-weight: bold">TRUE</span>, <span style="color: #008000; font-weight: bold">FALSE</span>), prob <span style="color: #666666">=</span> c(<span style="color: #666666">0.75</span>, <span style="color: #666666">0.25</span>),
                   size <span style="color: #666666">=</span> <span style="color: #666666">100</span>, replace <span style="color: #666666">=</span> <span style="color: #008000; font-weight: bold">TRUE</span>)
posterior <span style="color: #666666"><-</span> prop_model(big_data)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">quantile(posterior, c(<span style="color: #666666">0.025</span>, <span style="color: #666666">0.5</span>, <span style="color: #666666">0.975</span>))
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## 2.5%  50%  98% </span>
<span style="color: #408080; font-style: italic">## 0.68 0.77 0.84</span>
</pre></div>

So here we calculated that the underlying proportion of success is most likely 0.77 with a 95% CI of [0.68, 0.84] (which nicely includes the correct value of 0.75 which we used to simulate big_data).

To be clear, prop_model is not intended as anything serious, it’s just meant as a nice way of exploring the Beta-Binomial model when learning Bayesian statistics, maybe as part of a workshop exercise.

The prop_model function

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic"># This function takes a number of successes and failuers coded as a TRUE/FALSE</span>
<span style="color: #408080; font-style: italic"># or 0/1 vector. This should be given as the data argument.</span>
<span style="color: #408080; font-style: italic"># The result is a visualization of the how a Beta-Binomial</span>
<span style="color: #408080; font-style: italic"># model gradualy learns the underlying proportion of successes </span>
<span style="color: #408080; font-style: italic"># using this data. The function also returns a sample from the</span>
<span style="color: #408080; font-style: italic"># posterior distribution that can be further manipulated and inspected.</span>
<span style="color: #408080; font-style: italic"># The default prior is a Beta(1,1) distribution, but this can be set using the</span>
<span style="color: #408080; font-style: italic"># prior_prop argument.</span>

<span style="color: #408080; font-style: italic"># Make sure the packages tidyverse and ggridges are installed, otherwise run:</span>
<span style="color: #408080; font-style: italic"># install.packages(c("tidyverse", "ggridges"))</span>

<span style="color: #408080; font-style: italic"># Example usage:</span>
<span style="color: #408080; font-style: italic"># data <- c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE)</span>
<span style="color: #408080; font-style: italic"># prop_model(data)</span>
prop_model <span style="color: #666666"><-</span> <span style="color: #008000; font-weight: bold">function</span>(data <span style="color: #666666">=</span> c(), prior_prop <span style="color: #666666">=</span> c(<span style="color: #666666">1</span>, <span style="color: #666666">1</span>), n_draws <span style="color: #666666">=</span> <span style="color: #666666">10000</span>) {
  library(tidyverse)
  data <span style="color: #666666"><-</span> as.logical(data)
  <span style="color: #408080; font-style: italic"># data_indices decides what densities to plot between the prior and the posterior</span>
  <span style="color: #408080; font-style: italic"># For 20 datapoints and less we're plotting all of them.</span>
  data_indices <span style="color: #666666"><-</span> round(seq(<span style="color: #666666">0</span>, length(data), length.out <span style="color: #666666">=</span> min(length(data) <span style="color: #666666">+</span> <span style="color: #666666">1</span>, <span style="color: #666666">20</span>)))

  <span style="color: #408080; font-style: italic"># dens_curves will be a data frame with the x & y coordinates for the </span>
  <span style="color: #408080; font-style: italic"># denities to plot where x = proportion_success and y = probability</span>
  proportion_success <span style="color: #666666"><-</span> c(<span style="color: #666666">0</span>, seq(<span style="color: #666666">0</span>, <span style="color: #666666">1</span>, length.out <span style="color: #666666">=</span> <span style="color: #666666">100</span>), <span style="color: #666666">1</span>)
  dens_curves <span style="color: #666666"><-</span> map_dfr(data_indices, <span style="color: #008000; font-weight: bold">function</span>(i) {
    value <span style="color: #666666"><-</span> ifelse(i <span style="color: #666666">==</span> <span style="color: #666666">0</span>, <span style="color: #BA2121">"Prior"</span>, ifelse(data[i], <span style="color: #BA2121">"Success"</span>, <span style="color: #BA2121">"Failure"</span>))
    label <span style="color: #666666"><-</span> paste0(<span style="color: #BA2121">"n="</span>, i)
    probability <span style="color: #666666"><-</span> dbeta(proportion_success,
                         prior_prop[<span style="color: #666666">1</span>] <span style="color: #666666">+</span> sum(data[seq_len(i)]),
                         prior_prop[<span style="color: #666666">2</span>] <span style="color: #666666">+</span> sum(<span style="color: #666666">!</span>data[seq_len(i)]))
    probability <span style="color: #666666"><-</span> probability <span style="color: #666666">/</span> max(probability)
    data_frame(value, label, proportion_success, probability)
  })
  <span style="color: #408080; font-style: italic"># Turning label and value into factors with the right ordering for the plot</span>
  dens_curves<span style="color: #666666">$</span>label <span style="color: #666666"><-</span> fct_rev(factor(dens_curves<span style="color: #666666">$</span>label, levels <span style="color: #666666">=</span>  paste0(<span style="color: #BA2121">"n="</span>, data_indices )))
  dens_curves<span style="color: #666666">$</span>value <span style="color: #666666"><-</span> factor(dens_curves<span style="color: #666666">$</span>value, levels <span style="color: #666666">=</span> c(<span style="color: #BA2121">"Prior"</span>, <span style="color: #BA2121">"Success"</span>, <span style="color: #BA2121">"Failure"</span>))

  p <span style="color: #666666"><-</span> ggplot(dens_curves, aes(x <span style="color: #666666">=</span> proportion_success, y <span style="color: #666666">=</span> label,
                               height <span style="color: #666666">=</span> probability, fill <span style="color: #666666">=</span> value)) <span style="color: #666666">+</span>
    ggridges<span style="color: #666666">::</span>geom_density_ridges(stat<span style="color: #666666">=</span><span style="color: #BA2121">"identity"</span>, color <span style="color: #666666">=</span> <span style="color: #BA2121">"white"</span>, alpha <span style="color: #666666">=</span> <span style="color: #666666">0.8</span>,
                                  panel_scaling <span style="color: #666666">=</span> <span style="color: #008000; font-weight: bold">TRUE</span>, size <span style="color: #666666">=</span> <span style="color: #666666">1</span>) <span style="color: #666666">+</span>
    scale_y_discrete(<span style="color: #BA2121">""</span>, expand <span style="color: #666666">=</span> c(<span style="color: #666666">0.01</span>, <span style="color: #666666">0</span>)) <span style="color: #666666">+</span>
    scale_x_continuous(<span style="color: #BA2121">"Underlying proportion of success"</span>) <span style="color: #666666">+</span>
    scale_fill_manual(values <span style="color: #666666">=</span> hcl(<span style="color: #666666">120</span> <span style="color: #666666">*</span> <span style="color: #666666">2:0</span> <span style="color: #666666">+</span> <span style="color: #666666">15</span>, <span style="color: #666666">100</span>, <span style="color: #666666">65</span>), name <span style="color: #666666">=</span> <span style="color: #BA2121">""</span>, drop <span style="color: #666666">=</span> <span style="color: #008000; font-weight: bold">FALSE</span>,
                      labels <span style="color: #666666">=</span>  c(<span style="color: #BA2121">"Prior   "</span>, <span style="color: #BA2121">"Success   "</span>, <span style="color: #BA2121">"Failure   "</span>)) <span style="color: #666666">+</span>
    ggtitle(paste0(
      <span style="color: #BA2121">"Binomial model - Data: "</span>, sum(data),  <span style="color: #BA2121">" successes, "</span> , sum(<span style="color: #666666">!</span>data), <span style="color: #BA2121">" failures"</span>)) <span style="color: #666666">+</span>
    theme_light() <span style="color: #666666">+</span>
    theme(legend.position <span style="color: #666666">=</span> <span style="color: #BA2121">"top"</span>)
  print(p)

  <span style="color: #408080; font-style: italic"># Returning a sample from the posterior distribution that can be further </span>
  <span style="color: #408080; font-style: italic"># manipulated and inspected</span>
  posterior_sample <span style="color: #666666"><-</span> rbeta(n_draws, prior_prop[<span style="color: #666666">1</span>] <span style="color: #666666">+</span> sum(data), prior_prop[<span style="color: #666666">2</span>] <span style="color: #666666">+</span> sum(<span style="color: #666666">!</span>data))
  invisible(posterior_sample)
}
</pre></div>

To leave a comment for the author, please follow the link and comment on their blog: Publishable Stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)