sample_n_of(): a useful helper function

[This article was first published on Higher Order Functions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here’s the problem: I have some data with nested time series. Lots of them. It’s
like there’s many, many little datasets inside my data. There are too many
groups to plot all of the time series at once, so I just want to preview a
handful of them.

For a working example, suppose we want to visualize the top 50 American female
baby names over time. I start by adding up the total number of births for each
name, finding the overall top 50 most populous names, and then keeping just the
time series from those top names.

<span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">,</span><span class="w"> </span><span class="n">warn.conflicts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">

</span><span class="n">babynames</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">babynames</span><span class="o">::</span><span class="n">babynames</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">sex</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"F"</span><span class="p">)</span><span class="w">

</span><span class="n">top50</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">babynames</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">name</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">top_n</span><span class="p">(</span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">total</span><span class="p">)</span><span class="w"> 

</span><span class="c1"># keep just rows in babynames that match a row in top50</span><span class="w">
</span><span class="n">top_names</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">babynames</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">semi_join</span><span class="p">(</span><span class="n">top50</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"name"</span><span class="p">)</span><span class="w">
</span>

Hmm, so what does this look like?

<span class="n">ggplot</span><span class="p">(</span><span class="n">top_names</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">year</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">facet_wrap</span><span class="p">(</span><span class="s2">"name"</span><span class="p">)</span><span class="w">
</span>

An illegible plot because too many facets are plotted

Aaack, I can’t read anything! Can’t I just see a few of them?

This is a problem I face frequently, so frequently that I wrote a helper
function to handle this problem: sample_n_of(). This is not a very clever
name, but it works. Below I call the function from my personal R package
and plot just the data from four names.

<span class="c1"># For reproducible blogging</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">20180524</span><span class="p">)</span><span class="w">

</span><span class="n">top_names</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">tjmisc</span><span class="o">::</span><span class="n">sample_n_of</span><span class="p">(</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggplot</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> 
    </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">year</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
    </span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> 
    </span><span class="n">facet_wrap</span><span class="p">(</span><span class="s2">"name"</span><span class="p">)</span><span class="w">
</span>

A plot with four faceted timeseries

In this post, I walk through how this function works. It’s not very
complicated: It relies on some light tidy evaluation plus one obscure dplyr
function.

Working through the function

As usual, let’s start by sketching out the function we want to write:

<span class="n">sample_n_of</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="c1"># quote the dots</span><span class="w">
  </span><span class="n">dots</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">quos</span><span class="p">(</span><span class="n">...</span><span class="p">)</span><span class="w">
  
  </span><span class="c1"># ...now make things happen...</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>

where size are the number of groups to sample and ... are the columns names
that define the groups. We use quos(...) to capture and quote those column
names. (As I wrote before,
quotation is how we bottle up R code so we can deploy it for later.)

For interactive testing, suppose our dataset are the time series from the top 50
names and we want data from a sample of 5 names. In this case, the values for
the arguments would be:

<span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">top_names</span><span class="w">
</span><span class="n">size</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">5</span><span class="w">
</span><span class="n">dots</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">quos</span><span class="p">(</span><span class="n">name</span><span class="p">)</span><span class="w">
</span>

A natural way to think about this problem is that we want to sample subgroups of
the dataframe. First, we create a grouped version of the dataframe using
group_by(). The function group_by() also takes a ... argument where the
dots are typically names of columns in the dataframe. We want to take the
names inside of our dots, unquote them and plug them in to where the ...
goes in group_by(). This is what the tidy evaluation world calls
splicing.

Think of splicing as doing this:

<span class="c1"># Demo function that counts the number of arguments in the dots</span><span class="w">
</span><span class="n">count_args</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">quos</span><span class="p">(</span><span class="n">...</span><span class="p">))</span><span class="w">
</span><span class="n">example_dots</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">quos</span><span class="p">(</span><span class="n">var1</span><span class="p">,</span><span class="w"> </span><span class="n">var2</span><span class="p">,</span><span class="w"> </span><span class="n">var2</span><span class="p">)</span><span class="w">

</span><span class="c1"># Splicing turns the first form into the second one</span><span class="w">
</span><span class="n">count_args</span><span class="p">(</span><span class="o">!!!</span><span class="w"> </span><span class="n">example_dots</span><span class="p">)</span><span class="w">
</span><span class="c1">#> [1] 3</span><span class="w">
</span><span class="n">count_args</span><span class="p">(</span><span class="n">var1</span><span class="p">,</span><span class="w"> </span><span class="n">var2</span><span class="p">,</span><span class="w"> </span><span class="n">var2</span><span class="p">)</span><span class="w">
</span><span class="c1">#> [1] 3</span><span class="w">
</span>

So, we create a grouped dataframe by splicing our dots into the group_by()
function.

<span class="n">grouped</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="o">!!!</span><span class="w"> </span><span class="n">dots</span><span class="p">)</span><span class="w">
</span>

There is a helper function buried in dplyr called group_indices() which
returns the grouping index for each row in a grouped dataframe.

<span class="n">grouped</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">tibble</span><span class="o">::</span><span class="n">add_column</span><span class="p">(</span><span class="n">group_index</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">group_indices</span><span class="p">(</span><span class="n">grouped</span><span class="p">))</span><span class="w"> 
</span><span class="c1">#> # A tibble: 6,407 x 6</span><span class="w">
</span><span class="c1">#> # Groups:   name [50]</span><span class="w">
</span><span class="c1">#>     year sex   name          n    prop group_index</span><span class="w">
</span><span class="c1">#>    <dbl> <chr> <chr>     <int>   <dbl>       <int></span><span class="w">
</span><span class="c1">#>  1  1880 F     Mary       7065 0.0724           33</span><span class="w">
</span><span class="c1">#>  2  1880 F     Anna       2604 0.0267            4</span><span class="w">
</span><span class="c1">#>  3  1880 F     Emma       2003 0.0205           19</span><span class="w">
</span><span class="c1">#>  4  1880 F     Elizabeth  1939 0.0199           17</span><span class="w">
</span><span class="c1">#>  5  1880 F     Margaret   1578 0.0162           32</span><span class="w">
</span><span class="c1">#>  6  1880 F     Sarah      1288 0.0132           45</span><span class="w">
</span><span class="c1">#>  7  1880 F     Laura      1012 0.0104           29</span><span class="w">
</span><span class="c1">#>  8  1880 F     Catherine   688 0.00705          11</span><span class="w">
</span><span class="c1">#>  9  1880 F     Helen       636 0.00652          21</span><span class="w">
</span><span class="c1">#> 10  1880 F     Frances     605 0.00620          20</span><span class="w">
</span><span class="c1">#> # ... with 6,397 more rows</span><span class="w">
</span>

We can randomly sample five of the group indices and keep the rows for just
those groups.

<span class="n">unique_groups</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">group_indices</span><span class="p">(</span><span class="n">grouped</span><span class="p">))</span><span class="w">
</span><span class="n">sampled_groups</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">unique_groups</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="p">)</span><span class="w">
</span><span class="n">sampled_groups</span><span class="w">
</span><span class="c1">#> [1]  4 25 43 20 21</span><span class="w">

</span><span class="n">subset_of_the_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">group_indices</span><span class="p">(</span><span class="n">grouped</span><span class="p">)</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">sampled_groups</span><span class="p">)</span><span class="w">
</span><span class="n">subset_of_the_data</span><span class="w">
</span><span class="c1">#> # A tibble: 674 x 5</span><span class="w">
</span><span class="c1">#>     year sex   name         n      prop</span><span class="w">
</span><span class="c1">#>    <dbl> <chr> <chr>    <int>     <dbl></span><span class="w">
</span><span class="c1">#>  1  1880 F     Anna      2604 0.0267   </span><span class="w">
</span><span class="c1">#>  2  1880 F     Helen      636 0.00652  </span><span class="w">
</span><span class="c1">#>  3  1880 F     Frances    605 0.00620  </span><span class="w">
</span><span class="c1">#>  4  1880 F     Samantha    21 0.000215 </span><span class="w">
</span><span class="c1">#>  5  1881 F     Anna      2698 0.0273   </span><span class="w">
</span><span class="c1">#>  6  1881 F     Helen      612 0.00619  </span><span class="w">
</span><span class="c1">#>  7  1881 F     Frances    586 0.00593  </span><span class="w">
</span><span class="c1">#>  8  1881 F     Samantha    12 0.000121 </span><span class="w">
</span><span class="c1">#>  9  1881 F     Karen        6 0.0000607</span><span class="w">
</span><span class="c1">#> 10  1882 F     Anna      3143 0.0272   </span><span class="w">
</span><span class="c1">#> # ... with 664 more rows</span><span class="w">

</span><span class="c1"># Confirm that only five names are in the dataset</span><span class="w">
</span><span class="n">subset_of_the_data</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">name</span><span class="p">)</span><span class="w">
</span><span class="c1">#> # A tibble: 5 x 1</span><span class="w">
</span><span class="c1">#>   name    </span><span class="w">
</span><span class="c1">#>   <chr>   </span><span class="w">
</span><span class="c1">#> 1 Anna    </span><span class="w">
</span><span class="c1">#> 2 Helen   </span><span class="w">
</span><span class="c1">#> 3 Frances </span><span class="w">
</span><span class="c1">#> 4 Samantha</span><span class="w">
</span><span class="c1">#> 5 Karen</span><span class="w">
</span>

Putting these steps together, we get:

<span class="n">sample_n_of</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">dots</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">quos</span><span class="p">(</span><span class="n">...</span><span class="p">)</span><span class="w">
  
  </span><span class="n">group_ids</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">group_by</span><span class="p">(</span><span class="o">!!!</span><span class="w"> </span><span class="n">dots</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">group_indices</span><span class="p">()</span><span class="w">
  
  </span><span class="n">sampled_groups</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">group_ids</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="p">)</span><span class="w">
  
  </span><span class="n">data</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">filter</span><span class="p">(</span><span class="n">group_ids</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">sampled_groups</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>

We can test that the function works as we might expect. Sampling 10 names
returns the data for 10 names.

<span class="n">ten_names</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">top_names</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">sample_n_of</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">print</span><span class="p">()</span><span class="w">
</span><span class="c1">#> # A tibble: 1,326 x 5</span><span class="w">
</span><span class="c1">#>     year sex   name         n      prop</span><span class="w">
</span><span class="c1">#>    <dbl> <chr> <chr>    <int>     <dbl></span><span class="w">
</span><span class="c1">#>  1  1880 F     Sarah     1288 0.0132   </span><span class="w">
</span><span class="c1">#>  2  1880 F     Frances    605 0.00620  </span><span class="w">
</span><span class="c1">#>  3  1880 F     Rachel     166 0.00170  </span><span class="w">
</span><span class="c1">#>  4  1880 F     Samantha    21 0.000215 </span><span class="w">
</span><span class="c1">#>  5  1880 F     Deborah     12 0.000123 </span><span class="w">
</span><span class="c1">#>  6  1880 F     Shirley      8 0.0000820</span><span class="w">
</span><span class="c1">#>  7  1880 F     Carol        7 0.0000717</span><span class="w">
</span><span class="c1">#>  8  1880 F     Jessica      7 0.0000717</span><span class="w">
</span><span class="c1">#>  9  1881 F     Sarah     1226 0.0124   </span><span class="w">
</span><span class="c1">#> 10  1881 F     Frances    586 0.00593  </span><span class="w">
</span><span class="c1">#> # ... with 1,316 more rows</span><span class="w">

</span><span class="n">ten_names</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">name</span><span class="p">)</span><span class="w">
</span><span class="c1">#> # A tibble: 10 x 1</span><span class="w">
</span><span class="c1">#>    name    </span><span class="w">
</span><span class="c1">#>    <chr>   </span><span class="w">
</span><span class="c1">#>  1 Sarah   </span><span class="w">
</span><span class="c1">#>  2 Frances </span><span class="w">
</span><span class="c1">#>  3 Rachel  </span><span class="w">
</span><span class="c1">#>  4 Samantha</span><span class="w">
</span><span class="c1">#>  5 Deborah </span><span class="w">
</span><span class="c1">#>  6 Shirley </span><span class="w">
</span><span class="c1">#>  7 Carol   </span><span class="w">
</span><span class="c1">#>  8 Jessica </span><span class="w">
</span><span class="c1">#>  9 Patricia</span><span class="w">
</span><span class="c1">#> 10 Sharon</span><span class="w">
</span>

We can sample based on multiple columns too. Ten combinations of names and years
should return just ten rows.

<span class="n">top_names</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">sample_n_of</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">year</span><span class="p">)</span><span class="w"> 
</span><span class="c1">#> # A tibble: 10 x 5</span><span class="w">
</span><span class="c1">#>     year sex   name          n      prop</span><span class="w">
</span><span class="c1">#>    <dbl> <chr> <chr>     <int>     <dbl></span><span class="w">
</span><span class="c1">#>  1  1907 F     Jessica      17 0.0000504</span><span class="w">
</span><span class="c1">#>  2  1932 F     Catherine  5446 0.00492  </span><span class="w">
</span><span class="c1">#>  3  1951 F     Nicole       94 0.0000509</span><span class="w">
</span><span class="c1">#>  4  1953 F     Janet     17761 0.00921  </span><span class="w">
</span><span class="c1">#>  5  1970 F     Sharon     9174 0.00501  </span><span class="w">
</span><span class="c1">#>  6  1983 F     Melissa   23473 0.0131   </span><span class="w">
</span><span class="c1">#>  7  1989 F     Brenda     2270 0.00114  </span><span class="w">
</span><span class="c1">#>  8  1989 F     Pamela     1334 0.000670 </span><span class="w">
</span><span class="c1">#>  9  1994 F     Samantha  22817 0.0117   </span><span class="w">
</span><span class="c1">#> 10  2014 F     Kimberly   2891 0.00148</span><span class="w">
</span>

Next steps

There are a few tweaks we could make to this function. For example, in my
package’s version, I warn the user when the number of groups is too large.

<span class="n">too_many</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">top_names</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">tjmisc</span><span class="o">::</span><span class="n">sample_n_of</span><span class="p">(</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">)</span><span class="w">
</span><span class="c1">#> Warning: Sample size (100) is larger than number of groups (50). Using size</span><span class="w">
</span><span class="c1">#> = 50.</span><span class="w">
</span>

My version also randomly samples n of the rows when there are no grouping
variables provided.

<span class="n">top_names</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">tjmisc</span><span class="o">::</span><span class="n">sample_n_of</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1">#> # A tibble: 2 x 5</span><span class="w">
</span><span class="c1">#>    year sex   name          n     prop</span><span class="w">
</span><span class="c1">#>   <dbl> <chr> <chr>     <int>    <dbl></span><span class="w">
</span><span class="c1">#> 1  1934 F     Stephanie   128 0.000118</span><span class="w">
</span><span class="c1">#> 2  2007 F     Mary       3674 0.00174</span><span class="w">
</span>

One open question is how to handle data that’s already grouped. The function we
wrote above fails.

<span class="n">top_names</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">name</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">sample_n_of</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">year</span><span class="p">)</span><span class="w">
</span><span class="c1">#> Error in filter_impl(.data, quo): Result must have length 136, not 6407</span><span class="w">
</span>

Is this a problem?

Here I think failure is okay because what do we think should happen? It’s not
obvious. It should randomly choose 2 of the years for each name.
Should it be the same two years? Then this should be fine.

<span class="n">top_names</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">sample_n_of</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">year</span><span class="p">)</span><span class="w">
</span><span class="c1">#> # A tibble: 100 x 5</span><span class="w">
</span><span class="c1">#>     year sex   name         n    prop</span><span class="w">
</span><span class="c1">#>    <dbl> <chr> <chr>    <int>   <dbl></span><span class="w">
</span><span class="c1">#>  1  1970 F     Jennifer 46160 0.0252 </span><span class="w">
</span><span class="c1">#>  2  1970 F     Lisa     38965 0.0213 </span><span class="w">
</span><span class="c1">#>  3  1970 F     Kimberly 34141 0.0186 </span><span class="w">
</span><span class="c1">#>  4  1970 F     Michelle 34053 0.0186 </span><span class="w">
</span><span class="c1">#>  5  1970 F     Amy      25212 0.0138 </span><span class="w">
</span><span class="c1">#>  6  1970 F     Angela   24926 0.0136 </span><span class="w">
</span><span class="c1">#>  7  1970 F     Melissa  23742 0.0130 </span><span class="w">
</span><span class="c1">#>  8  1970 F     Mary     19204 0.0105 </span><span class="w">
</span><span class="c1">#>  9  1970 F     Karen    16701 0.00912</span><span class="w">
</span><span class="c1">#> 10  1970 F     Laura    16497 0.00901</span><span class="w">
</span><span class="c1">#> # ... with 90 more rows</span><span class="w">
</span>

Or, should those two years be randomly selected for each name? Then, we should
let do() handle that. do() takes some code that returns a dataframe, applies
it to each group, and returns the combined result.

<span class="n">top_names</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">name</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">do</span><span class="p">(</span><span class="n">sample_n_of</span><span class="p">(</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">year</span><span class="p">))</span><span class="w">
</span><span class="c1">#> # A tibble: 100 x 5</span><span class="w">
</span><span class="c1">#> # Groups:   name [50]</span><span class="w">
</span><span class="c1">#>     year sex   name       n      prop</span><span class="w">
</span><span class="c1">#>    <dbl> <chr> <chr>  <int>     <dbl></span><span class="w">
</span><span class="c1">#>  1  1913 F     Amanda   346 0.000528 </span><span class="w">
</span><span class="c1">#>  2  1953 F     Amanda   428 0.000222 </span><span class="w">
</span><span class="c1">#>  3  1899 F     Amy      281 0.00114  </span><span class="w">
</span><span class="c1">#>  4  1964 F     Amy     9579 0.00489  </span><span class="w">
</span><span class="c1">#>  5  1916 F     Angela   715 0.000659 </span><span class="w">
</span><span class="c1">#>  6  2005 F     Angela  2893 0.00143  </span><span class="w">
</span><span class="c1">#>  7  1999 F     Anna    9092 0.00467  </span><span class="w">
</span><span class="c1">#>  8  2011 F     Anna    5649 0.00292  </span><span class="w">
</span><span class="c1">#>  9  1952 F     Ashley    24 0.0000126</span><span class="w">
</span><span class="c1">#> 10  2006 F     Ashley 12340 0.00591  </span><span class="w">
</span><span class="c1">#> # ... with 90 more rows</span><span class="w">
</span>

I think raising an error and forcing the user to clarify their code is a better
than choosing one of these options and not doing what the user expects.

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)