Sow the seeds, know the seeds

[This article was first published on Maëlle, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When you do simulations, for instance in R, e.g. drawing samples from a distribution, it’s best to set a random seed via the function set.seed in order to have reproducible results. The function has no default value. I think I mostly use set.seed(1). Last week I received an R script from a colleague in which he used a weird number in set.seed (maybe a phone number? or maybe he let his fingers type randomly?), which made me curious about the usual seed values. As in my blog post about initial commit messages I used the Github API via the gh package to get a very rough answer (an answer seedling from the question seed?).

From Github API search endpoint you can get up to 1,000 results corresponding to a query which in the case of set.seed occurrences in R code isn’t the whole picture but hopefully a good sample. I wrote a function to treat the output of a query to the API where I take advantage of the stringr package. I just want the thing inside set.seed() from the text matches returned by the API.

<span class="n">get_seeds_from_matches</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">item</span><span class="p">){</span><span class="w">
  </span><span class="n">url</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">item</span><span class="o">$</span><span class="n">html_url</span><span class="w">
  </span><span class="n">matches</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">item</span><span class="o">$</span><span class="n">text_matches</span><span class="w">
  </span><span class="n">matches</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">matches</span><span class="p">,</span><span class="w"> </span><span class="s2">"[["</span><span class="p">,</span><span class="w"> </span><span class="s2">"fragment"</span><span class="p">))</span><span class="w">
  </span><span class="n">matches</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_split</span><span class="p">(</span><span class="n">matches</span><span class="p">,</span><span class="w"> </span><span class="s2">"\\\n"</span><span class="p">,</span><span class="w"> </span><span class="n">simplify</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
  </span><span class="n">matches</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_extract</span><span class="p">(</span><span class="n">matches</span><span class="p">,</span><span class="w"> </span><span class="s2">"set\\.seed\\(.*\\)"</span><span class="p">)</span><span class="w">
  </span><span class="n">matches</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_replace</span><span class="p">(</span><span class="n">matches</span><span class="p">,</span><span class="w"> </span><span class="s2">"set\\.seed\\("</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w">
  </span><span class="n">seeds</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_replace</span><span class="p">(</span><span class="n">matches</span><span class="p">,</span><span class="w"> </span><span class="s2">"\\).*"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w">
  </span><span class="n">seeds</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seeds</span><span class="p">[</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">seeds</span><span class="p">)]</span><span class="w">
  </span><span class="n">tibble</span><span class="o">::</span><span class="n">tibble</span><span class="p">(</span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seeds</span><span class="p">,</span><span class="w">
             </span><span class="n">url</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">url</span><span class="p">,</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">seeds</span><span class="p">)))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>

After that I made the queries themselves, pausing every 30 pages because of the rate limiting, and adding a try around the call in order to stop as soon as I reached the 1,000 results. Not a very elegant solution but I wasn’t in a perfectionnist mood.

Note that the header "Accept" = 'application/vnd.github.v3.text-match+json' is very important, without it you wouldn’t get the text fragments in the results.

<span class="n">library</span><span class="p">(</span><span class="s2">"gh"</span><span class="p">)</span><span class="w">
</span><span class="n">seeds</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="kc">NULL</span><span class="w">

</span><span class="n">ok</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="kc">TRUE</span><span class="w">
</span><span class="n">page</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="k">while</span><span class="p">(</span><span class="n">ok</span><span class="p">){</span><span class="w">
  </span><span class="n">matches</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">try</span><span class="p">(</span><span class="n">gh</span><span class="p">(</span><span class="s2">"/search/code"</span><span class="p">,</span><span class="w"> </span><span class="n">q</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"set.seed&language:r"</span><span class="p">,</span><span class="w">
                    </span><span class="n">.token</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sys.getenv</span><span class="p">(</span><span class="s2">"GITHUB_PAT"</span><span class="p">),</span><span class="w">
                    </span><span class="n">.send_headers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Accept"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'application/vnd.github.v3.text-match+json'</span><span class="p">),</span><span class="w">
                    </span><span class="n">page</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">page</span><span class="p">),</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
  </span><span class="n">ok</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="o">!</span><span class="n">is</span><span class="p">(</span><span class="n">matches</span><span class="p">,</span><span class="w"> </span><span class="s2">"try-error"</span><span class="p">)</span><span class="w">
 
  </span><span class="k">if</span><span class="p">(</span><span class="n">ok</span><span class="p">){</span><span class="w">
    </span><span class="n">seeds</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bind_rows</span><span class="p">(</span><span class="n">seeds</span><span class="p">,</span><span class="w"> </span><span class="n">bind_rows</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">matches</span><span class="o">$</span><span class="n">items</span><span class="p">,</span><span class="w"> 
                                               </span><span class="n">get_seeds_from_matches</span><span class="p">)))</span><span class="w">
  </span><span class="p">}</span><span class="w">

  </span><span class="n">page</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">page</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
  </span><span class="c1"># wait 2 minutes every 30 pages
</span><span class="w">  </span><span class="k">if</span><span class="p">(</span><span class="n">page</span><span class="w"> </span><span class="o">%%</span><span class="w"> </span><span class="m">30</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="n">page</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="m">1</span><span class="p">){</span><span class="w">
    </span><span class="n">Sys.sleep</span><span class="p">(</span><span class="m">120</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">
  
</span><span class="p">}</span><span class="w">

</span><span class="n">save</span><span class="p">(</span><span class="n">seeds</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"data/2017-04-12-seeds.RData.RData"</span><span class="p">)</span><span class="w">
</span>
<span class="n">library</span><span class="p">(</span><span class="s2">"magrittr"</span><span class="p">)</span><span class="w">
</span><span class="n">load</span><span class="p">(</span><span class="s2">"data/2017-04-12-seeds.RData"</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">seeds</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">()</span><span class="w">
</span>
seed url
1 https://github.com/berndbischl/ParamHelpers/blob/9d374430701d94639cc78db84f91a0c595927189/tests/testthat/helper_zzz.R
1 https://github.com/TypeFox/R-Examples/blob/d0917dbaf698cb8bc0789db0c3ab07453016eab9/ParamHelpers/tests/testthat/helper_zzz.R
1 https://github.com/cran/ParamHelpers/blob/92a49db23e69d32c8ae52585303df2875d740706/tests/testthat/helper_zzz.R
4.0 https://github.com/ACP-KR/AsanAdvR/blob/0517e88efce94266997d680e8b5a7c2a97c9277d/R-Object-Oriented-Programming-master/chapter4/chapter_4_ex11.R
4.0 https://github.com/ACP-KR/AsanAdvR/blob/0517e88efce94266997d680e8b5a7c2a97c9277d/R-Object-Oriented-Programming-master/chapter4/chapter_4_ex11.R
4.0 https://github.com/KellyBlack/R-Object-Oriented-Programming/blob/efbb0b81063baa30dd9d56d5d74b3f73b12b4926/chapter4/chapter_4_ex11.R

I got 984 entries, not 1,000 so maybe I lost some seeds in the process or the results weren’t perfect. The reason why I also added the URL of the script to the results was to be able to go and look at the code around surprising seeds.

Let’s have a look at the most frequent seeds in the sample.

<span class="n">table</span><span class="p">(</span><span class="n">seeds</span><span class="o">$</span><span class="n">seed</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">broom</span><span class="o">::</span><span class="n">tidy</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">dplyr</span><span class="o">::</span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">Freq</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">head</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">12</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">()</span><span class="w">
</span>
Var1 Freq
seed 312
1 134
123 60
iseed 48
10 47
13121098 28
ss 24
20 21
1234 18
42 18
123456 15
0 14

So the most prevalent seed is a mystery because I’m not motivated enough to go scrape the code to find if the seed gets assigned a value before, like in that tweet I saw today. I was happy that 1 was so popular, maybe it means I belong?

I was surprised by two values. First, 13121098.

<span class="n">dplyr</span><span class="o">::</span><span class="n">filter</span><span class="p">(</span><span class="n">seeds</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"13121098"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">head</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">()</span><span class="w">
</span>
seed url
13121098 https://github.com/DJRumble/Swirl-Course/blob/4e2771141e579904eb6dd32bce51ff6e0d840d44/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/swirldev/swirl_courses/blob/b3d432bfdf480c865af1c409ee0ee927c1fdbda0/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/1vbutkus/swirl/blob/310874100536e1e7c66861eced9ecb52939a3e0a/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/gotitsingh13/swirldev/blob/b7369b974ba76716fbcf6101bcbdc2db2f774d18/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/pauloramazza/swirl_courses/blob/4e2771141e579904eb6dd32bce51ff6e0d840d44/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/ildesoft/Swirl_Courses/blob/3e7f43cecbeb41e92e4f5972658f9b293e0e4b84/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/hrdg/Regression_Models/blob/22f47ecf2ae62f553aa132d3d948cc6b4e1599cc/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/Rizwanabro/Swirl-Course/blob/3e7f43cecbeb41e92e4f5972658f9b293e0e4b84/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/mkgiitr/Data-Analytics/blob/1d659db1e9137b1fe595a6ef3356887de431b1be/win-library/3.1/swirl/Courses/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/Jutair/R-programming-Coursera/blob/4faeed6ca780ee7f14b224c293cae77293146f37/Swirl/Rsubversion/branches/Writing_swirl_Courses/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R

I went and had a look and it seems most repositories correspond to code learnt in a Coursera course. I have taken a few courses from that specialization and loved it but I don’t remember learning about the special seed, too bad. Well I guess everyone used it to reproduce results but what does this number mean in the first place? Who typed it? A cat walking on the keyboard?

The other number that surprised me was 42 but then I remembered it is the “Answer to the Ultimate Question of Life, the Universe, and Everything” . I’d therefore say that this might be the coolest random seed. Now I can’t tell you whether it produces better results. Maybe it helps when your code actually tries to answer the Ultimate Question of Life, the Universe, and Everything?

To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)