Sow the seeds, know the seeds

April 11, 2017
By

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

When you do simulations, for instance in R, e.g. drawing samples from a distribution, it’s best to set a random seed via the function set.seed in order to have reproducible results. The function has no default value. I think I mostly use set.seed(1). Last week I received an R script from a colleague in which he used a weird number in set.seed (maybe a phone number? or maybe he let his fingers type randomly?), which made me curious about the usual seed values. As in my blog post about initial commit messages I used the Github API via the gh package to get a very rough answer (an answer seedling from the question seed?).

From Github API search endpoint you can get up to 1,000 results corresponding to a query which in the case of set.seed occurrences in R code isn’t the whole picture but hopefully a good sample. I wrote a function to treat the output of a query to the API where I take advantage of the stringr package. I just want the thing inside set.seed() from the text matches returned by the API.

get_seeds_from_matches <- function(item){
  url <- item$html_url
  matches <- item$text_matches
  matches <- unlist(lapply(matches, "[[", "fragment"))
  matches <- stringr::str_split(matches, "\\\n", simplify = TRUE)
  matches <- stringr::str_extract(matches, "set\\.seed\\(.*\\)")
  matches <- stringr::str_replace(matches, "set\\.seed\\(", "")
  seeds <- stringr::str_replace(matches, "\\).*", "")
  seeds <- seeds[!is.na(seeds)]
  tibble::tibble(seed = seeds,
             url = rep(url, length(seeds)))
}

After that I made the queries themselves, pausing every 30 pages because of the rate limiting, and adding a try around the call in order to stop as soon as I reached the 1,000 results. Not a very elegant solution but I wasn’t in a perfectionnist mood.

Note that the header "Accept" = 'application/vnd.github.v3.text-match+json' is very important, without it you wouldn’t get the text fragments in the results.

library("gh")
seeds <- NULL

ok <- TRUE
page <- 1
while(ok){
  matches <- try(gh("/search/code", q = "set.seed&language:r",
                    .token = Sys.getenv("GITHUB_PAT"),
                    .send_headers = c("Accept" = 'application/vnd.github.v3.text-match+json'),
                    page = page), silent = TRUE)
  ok <- !is(matches, "try-error")
 
  if(ok){
    seeds <- bind_rows(seeds, bind_rows(lapply(matches$items, 
                                               get_seeds_from_matches)))
  }

  page <- page + 1
  # wait 2 minutes every 30 pages
  if(page %% 30 == 1 & page > 1){
    Sys.sleep(120)
  }
  
}

save(seeds, file = "data/2017-04-12-seeds.RData.RData")
library("magrittr")
load("data/2017-04-12-seeds.RData")
head(seeds) %>%
  knitr::kable()
seed url
1 https://github.com/berndbischl/ParamHelpers/blob/9d374430701d94639cc78db84f91a0c595927189/tests/testthat/helper_zzz.R
1 https://github.com/TypeFox/R-Examples/blob/d0917dbaf698cb8bc0789db0c3ab07453016eab9/ParamHelpers/tests/testthat/helper_zzz.R
1 https://github.com/cran/ParamHelpers/blob/92a49db23e69d32c8ae52585303df2875d740706/tests/testthat/helper_zzz.R
4.0 https://github.com/ACP-KR/AsanAdvR/blob/0517e88efce94266997d680e8b5a7c2a97c9277d/R-Object-Oriented-Programming-master/chapter4/chapter_4_ex11.R
4.0 https://github.com/ACP-KR/AsanAdvR/blob/0517e88efce94266997d680e8b5a7c2a97c9277d/R-Object-Oriented-Programming-master/chapter4/chapter_4_ex11.R
4.0 https://github.com/KellyBlack/R-Object-Oriented-Programming/blob/efbb0b81063baa30dd9d56d5d74b3f73b12b4926/chapter4/chapter_4_ex11.R

I got 984 entries, not 1,000 so maybe I lost some seeds in the process or the results weren’t perfect. The reason why I also added the URL of the script to the results was to be able to go and look at the code around surprising seeds.

Let’s have a look at the most frequent seeds in the sample.

table(seeds$seed) %>%
  broom::tidy() %>%
  dplyr::arrange(desc(Freq)) %>%
  head(n = 12) %>%
  knitr::kable()
Var1 Freq
seed 312
1 134
123 60
iseed 48
10 47
13121098 28
ss 24
20 21
1234 18
42 18
123456 15
0 14

So the most prevalent seed is a mystery because I’m not motivated enough to go scrape the code to find if the seed gets assigned a value before, like in that tweet I saw today. I was happy that 1 was so popular, maybe it means I belong?

I was surprised by two values. First, 13121098.

dplyr::filter(seeds, seed == "13121098") %>%
  head(n = 10) %>% 
  knitr::kable()
seed url
13121098 https://github.com/DJRumble/Swirl-Course/blob/4e2771141e579904eb6dd32bce51ff6e0d840d44/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/swirldev/swirl_courses/blob/b3d432bfdf480c865af1c409ee0ee927c1fdbda0/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/1vbutkus/swirl/blob/310874100536e1e7c66861eced9ecb52939a3e0a/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/gotitsingh13/swirldev/blob/b7369b974ba76716fbcf6101bcbdc2db2f774d18/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/pauloramazza/swirl_courses/blob/4e2771141e579904eb6dd32bce51ff6e0d840d44/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/ildesoft/Swirl_Courses/blob/3e7f43cecbeb41e92e4f5972658f9b293e0e4b84/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/hrdg/Regression_Models/blob/22f47ecf2ae62f553aa132d3d948cc6b4e1599cc/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/Rizwanabro/Swirl-Course/blob/3e7f43cecbeb41e92e4f5972658f9b293e0e4b84/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/mkgiitr/Data-Analytics/blob/1d659db1e9137b1fe595a6ef3356887de431b1be/win-library/3.1/swirl/Courses/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/Jutair/R-programming-Coursera/blob/4faeed6ca780ee7f14b224c293cae77293146f37/Swirl/Rsubversion/branches/Writing_swirl_Courses/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R

I went and had a look and it seems most repositories correspond to code learnt in a Coursera course. I have taken a few courses from that specialization and loved it but I don’t remember learning about the special seed, too bad. Well I guess everyone used it to reproduce results but what does this number mean in the first place? Who typed it? A cat walking on the keyboard?

The other number that surprised me was 42 but then I remembered it is the “Answer to the Ultimate Question of Life, the Universe, and Everything” . I’d therefore say that this might be the coolest random seed. Now I can’t tell you whether it produces better results. Maybe it helps when your code actually tries to answer the Ultimate Question of Life, the Universe, and Everything?

To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)