Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Pokémon Go is an augmented reality game where people with
smartphones walk around and catch Pokémon. As in the classic games, players are
Pokémon “trainers” who have to travel around and collect creatures. Some types
are rarer than others, some have regional habitats or biomes, and so players
explore the game world to find and collect them. Pokémon Go
“augments reality” having these Pokémon spawn in the game world as players
travel around the real world. I like the game; I’ve reached level 30.

On February 16, 2017, a second “generation” of Pokémon were released into the
wild, adding dozens of new species to the game. The rarest among this new
generation
is Unown. As part of a cruel scheme to torture
completionists, the incredibly rare Unown comes in 26 varieties—one for each
letter of the alphabet. Which brings us to the statistical question behind this
blog post: How long will it take to encounter all 26 types of Unowns (taking
into account repeats)?
Somewhat more formally, how many random encounters
(i.e., draws while sampling with replacement) will it take until we have
encountered all 26 Unowns?

This general problem is called the coupon collector’s problem—hat tip
article for the problem includes a table which says that the expected number
of encounters for n = 26 is = 101. (The analytic solution is actually
100.2, but we round up because there are no fractional encounters.) So problem
solved?

Well, not quite. Suppose we had never heard of the coupon collector’s problem.
Also, suppose that we want to get a sense of the uncertainty around the
expected value. For example, how many encounters will it take for the
unfortunate trainers in the 95th percentile? How might we tackle this problem?

When analytic solutions are difficult or tedious, we can write simulations and
get very good approximations. That’s what we’re going to do in R.

We will write a function to simulate a single exhaustive set of Unown
encounters. The main workhorse is `sample(..., replace = TRUE)` to sample with
replacement. Using R’s built-in `LETTERS` constant, we can simulate a batch of
Unown encounters.

```<span class="n">sample</span><span class="p">(</span><span class="nb">LETTERS</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="c1">#> [1] "F"
</span><span class="n">sample</span><span class="p">(</span><span class="nb">LETTERS</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="c1">#> [1] "H" "U" "C" "G" "D"
</span><span class="n">sample</span><span class="p">(</span><span class="nb">LETTERS</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="c1">#>  [1] "T" "K" "K" "D" "Z" "O" "Y" "K" "Q" "Q"
</span>```

The question now is how many samples does it take to get the 26 different
Unowns. An absolute, and frankly miraculous, lower bound on this number would be
26, so let’s draw 26 samples.

```<span class="n">set.seed</span><span class="p">(</span><span class="m">252</span><span class="p">)</span><span class="w">  </span><span class="c1"># For reproducble blogging
</span><span class="n">n_unique</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">xs</span><span class="p">)</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">xs</span><span class="p">))</span><span class="w">

</span><span class="c1"># Unowns in the batch
</span><span class="n">first_batch</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="nb">LETTERS</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">26</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">first_batch</span><span class="w">
</span><span class="c1">#>  [1] "X" "S" "I" "T" "R" "J" "L" "N" "F" "I" "P" "Q" "K" "D" "Q" "Q" "K"
#> [18] "O" "S" "C" "F" "C" "P" "X" "V" "L"
</span><span class="w">
</span><span class="c1"># Number of unique ones
</span><span class="n">n_unique</span><span class="p">(</span><span class="n">first_batch</span><span class="p">)</span><span class="w">
</span><span class="c1">#> [1] 16
</span><span class="w">
</span><span class="c1"># Number of remaining Unowns we have not encountered yet
</span><span class="n">leftover</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">26</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">n_unique</span><span class="p">(</span><span class="n">first_batch</span><span class="p">)</span><span class="w">
</span><span class="n">leftover</span><span class="w">
</span><span class="c1">#> [1] 10
</span>```

We encountered 16 unique Unowns from the first batch of
samples. The best-case lower bound for the number of the encounters remaining is
now 10, so let’s take 10 more samples.

```<span class="n">second_batch</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="nb">LETTERS</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">leftover</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">second_batch</span><span class="w">
</span><span class="c1">#>  [1] "Y" "X" "S" "F" "P" "Z" "F" "F" "Y" "G"
</span><span class="w">
</span><span class="c1"># Combine the batches
</span><span class="n">both_batches</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">first_batch</span><span class="p">,</span><span class="w"> </span><span class="n">second_batch</span><span class="p">)</span><span class="w">
</span><span class="n">n_unique</span><span class="p">(</span><span class="n">both_batches</span><span class="p">)</span><span class="w">
</span><span class="c1">#> [1] 19
</span><span class="w">
</span><span class="n">leftover</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">26</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">n_unique</span><span class="p">(</span><span class="n">both_batches</span><span class="p">)</span><span class="w">
</span><span class="n">leftover</span><span class="w">
</span><span class="c1">#> [1] 7
</span>```

We found 3 new Unowns in this
batch, and we have encountered 19 unique ones so far
from 36 total samples. That means the lower bound is now
7.

```<span class="n">third_batch</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="nb">LETTERS</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">leftover</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">third_batch</span><span class="w">
</span><span class="c1">#> [1] "U" "A" "G" "P" "K" "L" "E"
</span><span class="w">
</span><span class="n">all_batches</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">both_batches</span><span class="p">,</span><span class="w"> </span><span class="n">third_batch</span><span class="p">)</span><span class="w">
</span><span class="n">n_unique</span><span class="p">(</span><span class="n">all_batches</span><span class="p">)</span><span class="w">
</span><span class="c1">#> [1] 22
</span><span class="w">
</span><span class="n">leftover</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">26</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">n_unique</span><span class="p">(</span><span class="n">all_batches</span><span class="p">)</span><span class="w">
</span><span class="n">leftover</span><span class="w">
</span><span class="c1">#> [1] 4
</span>```

We found 3 new Unowns in this—

Actually, this is getting tedious. We all know where this process is going: Take
a sample, see how many you have left to find, take another sample of that size,
etc. until you have 0 left to find. Pretty simple? Great! Now, I don’t have to
explain how the `while` loop inside the function works.

```<span class="n">simulate_unown</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># Use a sequence of numbers instead of LETTERS
</span><span class="w">  </span><span class="n">n_unowns</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">26</span><span class="w">
</span><span class="n">unown_set</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">seq_len</span><span class="p">(</span><span class="n">n_unowns</span><span class="p">)</span><span class="w">

</span><span class="n">n_unique</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">encountered</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">character</span><span class="p">(</span><span class="m">0</span><span class="p">)</span><span class="w">

</span><span class="c1"># Take 26 samples on first iteration, 26 - n_unique on next iteration, etc.
</span><span class="w">  </span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="n">n_unowns</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">n_unique</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">batch</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">unown_set</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n_unowns</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">n_unique</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">encountered</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">encountered</span><span class="p">,</span><span class="w"> </span><span class="n">batch</span><span class="p">)</span><span class="w">
</span><span class="n">n_unique</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">encountered</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="nf">length</span><span class="p">(</span><span class="n">encountered</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>```

Each call of the function simulates a process of encountering all 26 Unowns,
returning how many encounters were required to find them all.

```<span class="n">simulate_unown</span><span class="p">()</span><span class="w">
</span><span class="c1">#> [1] 111
</span><span class="n">simulate_unown</span><span class="p">()</span><span class="w">
</span><span class="c1">#> [1] 81
</span><span class="n">simulate_unown</span><span class="p">()</span><span class="w">
</span><span class="c1">#> [1] 116
</span>```

We use `replicate()` to call this function thousands of times.

```<span class="n">simulation</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="m">10000</span><span class="p">,</span><span class="w"> </span><span class="n">simulate_unown</span><span class="p">())</span><span class="w">
</span>```

We can get summary statistics and other quantiles for these simulations.

```<span class="n">summary</span><span class="p">(</span><span class="n">simulation</span><span class="p">)</span><span class="w">
</span><span class="c1">#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>    41.0    78.0    95.0   100.3   116.0   373.0
</span><span class="n">quantile</span><span class="p">(</span><span class="n">simulation</span><span class="p">,</span><span class="w"> </span><span class="n">probs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">.05</span><span class="p">,</span><span class="w"> </span><span class="m">.10</span><span class="p">,</span><span class="w"> </span><span class="m">.25</span><span class="p">,</span><span class="w"> </span><span class="m">.50</span><span class="p">,</span><span class="w"> </span><span class="m">.75</span><span class="p">,</span><span class="w"> </span><span class="m">.90</span><span class="p">,</span><span class="w"> </span><span class="m">.95</span><span class="p">))</span><span class="w">
</span><span class="c1">#>  5% 10% 25% 50% 75% 90% 95%
#>  61  67  78  95 116 141 161
</span>```

The mean in our simulations 100.3 is very close to
the analytic solution of 100.2. The median 95 is less than
the mean, which is a bit of good news: More than half of players will hit 26
in less than the expected 100 encounters. The bad news is that there is a long
tail to these simulations, as the RNG gods have cursed one player by requiring
373 encounters.

We can visualize the distribution with a histogram.

```<span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">p</span><span class="m">1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">simulation</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_histogram</span><span class="p">(</span><span class="n">binwidth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"white"</span><span class="p">,</span><span class="w"> </span><span class="n">center</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">102.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Num. Unowns encountered until 26 unique Unowns encountered"</span><span class="p">,</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Num. samples in 10,000 simulations"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">axis.title.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.995</span><span class="p">),</span><span class="w">
</span><span class="n">axis.title.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.995</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ggtitle</span><span class="p">(</span><span class="s2">"A long-tail of unfortunate RNG for Unown completionists"</span><span class="p">)</span><span class="w">
</span><span class="n">p</span><span class="m">1</span><span class="w">
</span>```

I haven’t seen any of these Pokémon in the month since they were added to the
game. Assuming I see an Unown every month, I can expect to see them all in 100
encounters over the course of 8.3 years. As I said, a cruel
joke for completionists.

## What? SIMULATE_UNOWN is evolving!

One nice thing about using simulations to compute statistics is that we can
easily modify the simulation function to answer related questions. For example:

• The above simulation assumed that we can catch the Unowns 100% of the time.
What if we fail on 5% of the encounters?
• Two more Unowns were added to the series in the third Pokémon generation.
What is the expected number of encounters for 28 Unowns?
• Suppose we already have 20 Unowns. How many more encounters are required to
collect the remaining 6?

We can add some parameters to our function to address these questions. To
simulate catching, we sample a `TRUE` or `FALSE` value for each Unown sampled. The
`prob` argument for `sample()` lets us assign probabilities to elements in the
sample, so we can use `c(p_catch, 1 - p_catch)` as probabilities for sampling
`TRUE` or `FALSE`—that is, catching a Pokémon.

values for `encountered` and `n_unique` to give those values a head start
before the encounter/catch loop. Afterwards, we have to adjust our encounter

```<span class="c1"># Defaults to 26 unowns, 0 already caught, and 100% success rate
</span><span class="n">simulate_unown_catches</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n_unowns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">26</span><span class="p">,</span><span class="w"> </span><span class="n">n_already</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">p_catch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">unown_set</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">seq_len</span><span class="p">(</span><span class="n">n_unowns</span><span class="p">)</span><span class="w">
</span><span class="n">catch_probs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">p_catch</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">p_catch</span><span class="p">)</span><span class="w">

</span><span class="c1"># Take into account any previously caught ones
</span><span class="w">  </span><span class="n">n_unique</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">n_already</span><span class="w">
</span><span class="n">already_encountered</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">seq_len</span><span class="p">(</span><span class="n">n_already</span><span class="p">)</span><span class="w">

</span><span class="n">encountered</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">already_encountered</span><span class="w">
</span><span class="n">caught</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">already_caught</span><span class="w">

</span><span class="c1"># Encounter/catch loop
</span><span class="w">  </span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="n">n_unowns</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">n_unique</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">batch</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">unown_set</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n_unowns</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">n_unique</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="c1"># Simulate catching success for each Unown
</span><span class="w">    </span><span class="n">catches</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n_unowns</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">n_unique</span><span class="p">,</span><span class="w">
</span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">catch_probs</span><span class="p">)</span><span class="w">
</span><span class="n">caught_in_batch</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">batch</span><span class="p">[</span><span class="n">catches</span><span class="p">]</span><span class="w">

</span><span class="n">encountered</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">encountered</span><span class="p">,</span><span class="w"> </span><span class="n">batch</span><span class="p">)</span><span class="w">
</span><span class="n">caught</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">caught</span><span class="p">,</span><span class="w"> </span><span class="n">caught_in_batch</span><span class="p">)</span><span class="w">
</span><span class="n">n_unique</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">caught</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="nf">length</span><span class="p">(</span><span class="n">encountered</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">already_encountered</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>```

With the default settings, the function should reproduce the original behavior
and give us similar results.

```<span class="n">simulation2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="m">10000</span><span class="p">,</span><span class="w"> </span><span class="n">simulate_unown_catches</span><span class="p">())</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">simulation2</span><span class="p">)</span><span class="w">
</span><span class="c1">#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>    38.0    78.0    94.0   100.2   115.0   369.0
</span>```

We should expect the average number of required encounters to catch all 26
Unowns to increase by 1.05 if there’s a 95% catch rate. Our simulations confirm
this intuition.

```<span class="n">simulation_95_rate</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="p">,</span><span class="w">
</span><span class="n">expr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">simulate_unown_catches</span><span class="p">(</span><span class="n">n_unowns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">26</span><span class="p">,</span><span class="w"> </span><span class="n">p_catch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.95</span><span class="p">))</span><span class="w">

</span><span class="n">summary</span><span class="p">(</span><span class="n">simulation_95_rate</span><span class="p">)</span><span class="w">
</span><span class="c1">#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>    41.0    82.0   100.0   106.2   124.0   338.0
</span>```

We can also simulate the case for 28 Unowns with a 100% encounter rate, and see
that the average number of required encounters increases by approximately 10.

```<span class="n">simulation_28_unowns</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="p">,</span><span class="w">
</span><span class="n">expr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">simulate_unown_catches</span><span class="p">(</span><span class="n">n_unowns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">28</span><span class="p">,</span><span class="w"> </span><span class="n">p_catch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">

</span><span class="n">summary</span><span class="p">(</span><span class="n">simulation_28_unowns</span><span class="p">)</span><span class="w">
</span><span class="c1">#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>    44.0    85.0   104.0   109.9   127.0   322.0
</span>```

Finally, we can simulate the home stretch of Unown completion where we
have 20 Unowns with 6 more to go.

```<span class="n">simulation_last_6</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="p">,</span><span class="w">
</span><span class="n">expr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">simulate_unown_catches</span><span class="p">(</span><span class="n">n_unowns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">26</span><span class="p">,</span><span class="w"> </span><span class="n">n_already</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">p_catch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">

</span><span class="n">summary</span><span class="p">(</span><span class="n">simulation_last_6</span><span class="p">)</span><span class="w">
</span><span class="c1">#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>    8.00   42.00   57.00   63.39   79.00  282.00
</span>```

This result is interesting: It says that we will spend the majority of our time
working on the last 6 Unowns.

## Simulate ‘em all

A natural next step is to run many simulations over a range of parameter
values. Here, we can study the home-stretch behavior by running a simulation
for each value of `n_already` from 0 to 25.

We will create a function to simulate 2000 home-stretch samples of required
encounters for a given number of starting Unowns. This function will return a
dataframe with one row per simulation sample. We use `Map()` to apply this
function for each value of `n_already` from 0 to 25.

```<span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">,</span><span class="w"> </span><span class="n">warn.conflicts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">

</span><span class="n">simulate_2k</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n_already</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">n_sims</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">2000</span><span class="w">
</span><span class="n">sim_results</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n_sims</span><span class="p">,</span><span class="w">
</span><span class="n">expr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">simulate_unown_catches</span><span class="p">(</span><span class="n">n_unowns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">26</span><span class="p">,</span><span class="w"> </span><span class="n">n_already</span><span class="p">,</span><span class="w"> </span><span class="n">p_catch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">

</span><span class="c1"># Package everything together in a dataframe
</span><span class="w">  </span><span class="n">data_frame</span><span class="p">(</span><span class="w">
</span><span class="n">n_already</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">n_already</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n_sims</span><span class="p">),</span><span class="w">
</span><span class="n">simulation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">seq_len</span><span class="p">(</span><span class="n">n_sims</span><span class="p">),</span><span class="w">
</span><span class="n">n_encounters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sim_results</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">results</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Map</span><span class="p">(</span><span class="n">simulate_2k</span><span class="p">,</span><span class="w"> </span><span class="n">n_already</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="o">:</span><span class="m">25</span><span class="p">)</span><span class="w">
</span><span class="n">df_results</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bind_rows</span><span class="p">(</span><span class="n">results</span><span class="p">)</span><span class="w">
</span><span class="n">df_results</span><span class="w">
</span><span class="c1">#> # A tibble: 52,000 × 3
#>        <int>      <int>        <int>
#> 1          0          1          157
#> 2          0          2          125
#> 3          0          3          129
#> 4          0          4          107
#> 5          0          5           69
#> 6          0          6           65
#> 7          0          7           70
#> 8          0          8          147
#> 9          0          9           69
#> 10         0         10           77
#> # ... with 51,990 more rows
</span>```

We can now plot the expected number of encounters for each starting value.
Variance from random simulation keeps the points from following a neat curve,
so let’s just appreciate the main trends in the results.

```<span class="n">ggplot</span><span class="p">(</span><span class="n">df_results</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n_already</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n_encounters</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">stat_summary</span><span class="p">(</span><span class="n">fun.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">,</span><span class="w"> </span><span class="n">geom</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"point"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Num. Unown types already encountered"</span><span class="p">,</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Expected num. encounters to find all 26"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">axis.title.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.995</span><span class="p">),</span><span class="w">
</span><span class="n">axis.title.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.995</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ggtitle</span><span class="p">(</span><span class="s2">"The painful home stretch of Unown completion"</span><span class="p">)</span><span class="w">
</span>```

Because the analytic solution for finding 26 Unowns starting from 0 is 100, we
can also read the y-axis as a percentage. In other words, 60% of the work
(number of encounters out of 100) will be spent on the last 5 Unowns.

## Simulation as a way to learn statistics

An underlying theme for this post is best summarized by a line from a talk
called Statistics for Hackers:

“If you can write for loops, you can do
statistics.” — Statistics for Hackers

Simulation provides a way for “hackers” to leverage one skill (programming) to
learn another domain (statistics). I myself find that I have trouble learning
statistics from equations alone. I need code to play with a problem and develop

All of these points about the nice features of simulation must be painfully
obvious to professional statisticians. But strangely, I only used simulations
once or twice in my statistics courses, so it wasn’t part of my statistical tool
kit. Indeed, I had the usefulness of simulation impressed upon me last year by
Gelman and Hill’s textbook. The book advises the
calculations—for example, trying to estimate a 95% prediction interval for the
income difference between two groups when the underlying statistical model used
log-dollars. Just have the model generate thousands for predictions, run the
transformation, and find the interval on the transformed data. The book also
uses simulation as a sneaky backdoor into informal Bayesian statistics: Measure
uncertainty by simulating new data from perturbed model parameters. This idea
made enough sense to me to lure me into the chapters on formal Bayesian
inference and learn that statistical framework.