# A Stan case study, sort of: The probability my son will be stung by a bumblebee

**Publishable Stuff**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Stan project for statistical computation has a great collection of curated case studies which anybody can contribute to, maybe even me, I was thinking. But I don’t have time to worry about that right now because I’m on vacation, being on the yearly visit to my old family home in the north of Sweden.

What I *do* worry about is that my son will be stung by a bumblebee. His name is Torsten, he’s almost two years old, and he loves running around barefoot on the big lawn. Which has its fair share of bumblebees. Maybe I should put shoes on him so he wont step on one, but what are the chances, really.

Well, what *are* the chances? I guess if I only had

- Data on the bumblebee density of the lawn.
- Data on the size of Torsten’s feet and how many steps he takes when running around.
- A reasonable Bayesian model, maybe implemented in Stan.

I could figure that out. “How hard can it be?”, I thought. And so I made an attempt.

## Getting the data

To get some data on bumblebee density I marked out a 1 m² square on a representative part of the lawn. During the course of the day, now and then, I counted up how many bumblebees sat in the square.

Most of the time I saw zero bumblebees, but 1 m² is not that large of an area. Let’s put the data into R:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">bumblebees <span style="color: #666666"><-</span> c(<span style="color: #666666">1</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">1</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">1</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">1</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">1</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>, <span style="color: #666666">0</span>) </pre></div>

During the same day I kept an eye on Torsten, and sometimes when he was engaged in *active play* I started a 60 s. timer and counted how many steps he took while running around. Here are the counts:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">toddler_steps <span style="color: #666666"><-</span> c(<span style="color: #666666">26</span>, <span style="color: #666666">16</span>, <span style="color: #666666">37</span>, <span style="color: #666666">101</span>, <span style="color: #666666">12</span>, <span style="color: #666666">122</span>, <span style="color: #666666">90</span>, <span style="color: #666666">55</span>, <span style="color: #666666">56</span>, <span style="color: #666666">39</span>, <span style="color: #666666">55</span>, <span style="color: #666666">15</span>, <span style="color: #666666">45</span>, <span style="color: #666666">8</span>) </pre></div>

Finally I needed to know the area of Torsten’s feet. At a moment when he was *not* running around I put his foot on a piece of graph paper and traced its edge, which made him giggle. I then calculated how many of the 0.5 cm² squares were fully covered and how many were partially covered.

To estimate of the area of his foot I took the average of the number of full squares and the number partial squares, and then converted to m²:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">full_squares <span style="color: #666666"><-</span> <span style="color: #666666">174</span> partial_squares <span style="color: #666666"><-</span> <span style="color: #666666">251</span> squares <span style="color: #666666"><-</span> (full_squares <span style="color: #666666">+</span> partial_squares) <span style="color: #666666">/</span> <span style="color: #666666">2</span> foot_cm2 <span style="color: #666666"><-</span> squares <span style="color: #666666">/</span> <span style="color: #666666">4</span> foot_m2 <span style="color: #666666"><-</span> foot_cm2 <span style="color: #666666">/</span> <span style="color: #666666">100^2</span> foot_m2 </pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## [1] 0.0053125</span> </pre></div>

Turns out my son’s feet are about 0.0053 m² each.

## Specifying the model

The idea I had here was that *if* you know the average number of steps Torsten takes per minute while playing, and *if* you know the average number of bumblebees per m², *then* you could calculate the average area covered by Torsten’s tiny feet while running around, and then *finally* you could calculate the probability of a bumblebee being in that area. This would give you a probability for how likely Torsten is to be stung by a Bumblebee. However, while we have data on all of the above, we don’t know any of the above for certain.

So, to capture some of this uncertainty I thought I would whip together a small Bayesian model and, even though it’s a bit overkill in this case, why not fit it using Stan? (If you are unfamiliar with Bayesian data analysis and Stan I’ve made a short video tutorial you can find here.)

Let’s start our Stan program by declaring what data we have:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">data { int n_bumblebee_observations; int bumblebees[n_bumblebee_observations]; int n_toddler_steps_observations; int toddler_steps[n_toddler_steps_observations]; real foot_m2; } </pre></div>

Now we need to decide on a model for the number of `bumblebees`

in a m² and a model for the number of `toddler_steps`

in 60 s. For the bees I’m going with a simple *Poisson* model, that is, the model assumes there is an average number of bees per m² ($\mu_\text{bees}$) and that this is all there is to know about bees.

For the number of steps I’m going to step it up a notch with a *negative binomial* model. The negative binomial distribution can be viewed in a number of ways, but one particularly useful way is as an extension of the Poisson distribution where the mean number of steps for each outcome is not fixed, but instead is varying, where this is controlled by the precision parameter $\phi_\text{steps}$. The *larger* $\phi_\text{steps}$ is the *less* the mean number of steps for each outcome is going to vary around the overall mean $\mu_\text{steps}$. That is, when $\phi_\text{steps} \rightarrow \infty$ then the negative binomial becomes a Poisson distribution. The point with using the negative binomial is to capture that Torsten’s activity level when playing on the lawn is definitely *not* constant. Sometimes he can run full speed for minutes, but sometimes he spends a long time being fascinated by the same little piece of rubbish and then he’s not taking many steps.

So we almost have a Bayesian model for our data, we just need to put priors on the parameters $\mu_\text{bees}$, $\mu_\text{steps}$, and $\phi_\text{steps}$. All three parameters are positive and a quick n’ dirty prior when you have positive parameters is a *half-Cauchy* distribution centered at zero: It’s just like half a normal distribution centered at zero, but with a much fatter tail. It has one free parameter, the scale, which also happens to be the median of the half-Cauchy. Set the scale/median to a good guess for the parameter in question and you are ready to go! If your guess *is* good then this prior will give the parameter the slightest nudge in the right direction, if your guess is bad then, as long as you have enough data, the fat tail of the half-Cauchy distribution will help you save face.

My guess for $\mu_\text{bees}$ is (4.0 + 4.0) / (10 × 10) = 0.08. I base this on that while crossing the lawn I usually scan an area of about 10×10 m², I then usually see around four bees, and I assume I’m missing half of the bees. So eight bees on a 100 m² lawn gives 0.08 bees per m². My guess for the number of steps per minute is going to be 60. For the precision parameter $\phi_\text{steps}$ I have no clue, but it shouldn’t be too large nor to small, so my guess is going to be 1.0. Here are the priors:

With the priors specified we now have all the required parts and here is the resulting Bayesian model:

$$%

And here is the Stan code implementing the model above:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">parameters { real<span style="color: #666666"><</span>lower<span style="color: #666666">=0></span> mu_bees; real<span style="color: #666666"><</span>lower<span style="color: #666666">=0></span> mu_steps; real<span style="color: #666666"><</span>lower<span style="color: #666666">=0></span> precision_steps; } model { <span style="color: #408080; font-style: italic"># Since we have contrained the parameters to be positive we get implicit</span> <span style="color: #408080; font-style: italic"># half-cauchy distributions even if we declare them to be 'full'-cauchy.</span> mu_bees <span style="color: #666666">~</span> cauchy(<span style="color: #666666">0</span>, (<span style="color: #666666">4.0</span> <span style="color: #666666">+</span> <span style="color: #666666">4.0</span>) <span style="color: #666666">/</span> (<span style="color: #666666">10.0</span> <span style="color: #666666">*</span> <span style="color: #666666">10.0</span>) ); mu_steps <span style="color: #666666">~</span> cauchy(<span style="color: #666666">0</span>, <span style="color: #666666">60</span>); precision_steps <span style="color: #666666">~</span> cauchy(<span style="color: #666666">0</span>, <span style="color: #666666">1</span>); bumblebees <span style="color: #666666">~</span> poisson(mu_bees); toddler_steps <span style="color: #666666">~</span> neg_binomial_2(mu_steps, precision_steps); } </pre></div>

The final step is to predict how many bumblebees Torsten will step on during, say, one hour of active play. We do this in the `generated quantities`

code block in Stan. The code below will step through 60 minutes (a.k.a. one hour) and for each minute: (1) Sample `pred_steps`

from the negative binomial, (2) calculate the area (m²) covered by these steps, and (3) sample the number of bees in this area from the Poisson giving a predicted `stings_by_minute`

. Finally we sum these 60 minutes worth of strings into `stings_by_hour`

.

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">generated quantities { int stings_by_minute[<span style="color: #666666">60</span>]; int stings_by_hour; int pred_steps; real stepped_area; <span style="color: #008000; font-weight: bold">for</span>(minute <span style="color: #008000; font-weight: bold">in</span> <span style="color: #666666">1:60</span>) { pred_steps <span style="color: #666666">=</span> neg_binomial_2_rng(mu_steps, precision_steps); stepped_area <span style="color: #666666">=</span> pred_steps <span style="color: #666666">*</span> foot_m2; stings_by_minute[minute] <span style="color: #666666">=</span> poisson_rng(mu_bees <span style="color: #666666">*</span> stepped_area); } stings_by_hour <span style="color: #666666">=</span> sum(stings_by_minute); } </pre></div>

## Looking at the result

We have data, we have a model, and now all we need to do is fit it. After we’ve put the whole Stan model into `bee_model_code`

as a string we do it like this using the `rstan`

interface:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic"># First we put all the data into a list</span> data_list <span style="color: #666666"><-</span> list( n_bumblebee_observations <span style="color: #666666">=</span> length(bumblebees), bumblebees <span style="color: #666666">=</span> bumblebees, n_toddler_steps_observations <span style="color: #666666">=</span> length(toddler_steps), toddler_steps <span style="color: #666666">=</span> toddler_steps, foot_m2 <span style="color: #666666">=</span> foot_m2 ) <span style="color: #408080; font-style: italic"># Then we fit the model which description is in bee_model_code .</span> library(rstan) fit <span style="color: #666666"><-</span> stan(model_code <span style="color: #666666">=</span> bee_model_code, data <span style="color: #666666">=</span> data_list) </pre></div>

With the model fitted, let’s take a look at the posterior probability distribution of our parameters.

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">stan_hist(fit, c(<span style="color: #BA2121">"mu_bees"</span>, <span style="color: #BA2121">"mu_steps"</span>, <span style="color: #BA2121">"precision_steps"</span>)) print(fit, c(<span style="color: #BA2121">"mu_bees"</span>, <span style="color: #BA2121">"mu_steps"</span>, <span style="color: #BA2121">"precision_steps"</span>)) </pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat</span> <span style="color: #408080; font-style: italic">## mu_bees 0.12 0.00 0.05 0.04 0.08 0.11 0.15 0.23 3551 1</span> <span style="color: #408080; font-style: italic">## mu_steps 50.88 0.22 11.65 32.56 43.09 49.34 57.13 78.00 2801 1</span> <span style="color: #408080; font-style: italic">## precision_steps 1.80 0.01 0.68 0.78 1.32 1.70 2.17 3.41 2706 1</span> </pre></div>

This seems reasonable, *but* note that the distributions are pretty wide, which means there is a lot of uncertainty! For example, looking at `mu_bees`

it’s credible that there is everything from 0.04 bees/m² to 0.2 bees/m².

Anyway, let’s take a look at what we are really interested in, the predictive distribution over how many stings per hour Torsten will suffer during active play:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic"># Getting the sample representing the prob. distribution over stings/hour .</span> stings_by_hour <span style="color: #666666"><-</span> as.data.frame(fit)<span style="color: #666666">$</span>stings_by_hour <span style="color: #408080; font-style: italic"># Now actually calculating the prob of 0 stings, 1 stings, 2 stiongs, etc.</span> strings_probability <span style="color: #666666"><-</span> prop.table( table(stings_by_hour) ) <span style="color: #408080; font-style: italic"># Plotin' n' printin'</span> barplot(strings_probability, col <span style="color: #666666">=</span> <span style="color: #BA2121">"yellow"</span>, main <span style="color: #666666">=</span> <span style="color: #BA2121">"Posterior predictive stings per hour"</span>, xlab <span style="color: #666666">=</span> <span style="color: #BA2121">"Number of stings"</span>, ylab <span style="color: #666666">=</span> <span style="color: #BA2121">"Probability"</span>) </pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">round(strings_probability, <span style="color: #666666">2</span>) </pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## stings_by_hour</span> <span style="color: #408080; font-style: italic">## 0 1 2 3 4 5 6 7 8 9 10 </span> <span style="color: #408080; font-style: italic">## 0.27 0.30 0.22 0.12 0.05 0.02 0.01 0.00 0.00 0.00 0.00</span> </pre></div>

Ok, it seems like it is most probable that Torsten will receive one sting per hour, but we should not be surprised if it’s two or even three stings. I’d better put some shoes on him! The *problem* is that after a couple of days full of active barefoot play, my son Torsten’s feet look like this:

As you can see his feet are *not* swollen from all the bee stings they should receive according to the model. Actually, even after a week, he has not gotten a single bee sting! Which is good, I suppose, in a way, but, well, it means that my model is likely pretty crap. How can this be? Well,

- I should maybe have gotten more and better data. Perhaps the square I monitored for bumblebees didn’t yield data that was really representative of the bumblebee density in general.
- The assumption that bumblebees always sting when stepped upon might be wrong. Or maybe Torsten is so smart so that he actively avoids them…
- Maybe the model was too simplistic. I really should have made a hierarchical model incorporating data from multiple squares and several days of running around. To factor in the effect of the weather, the flower density, and the diet of Torsent also couldn’t hurt.

I guess I could spend some time improving the model. And I guess there is a lesson to be learned here about that it is hard to predict the predictive performance of a model. But all that has to wait. I am on vacation after all.

*The full data and code behind this blog post can be found here.*

**leave a comment**for the author, please follow the link and comment on their blog:

**Publishable Stuff**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.