Sow the seeds, know the seeds

[This article was first published on Maëlle, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When you do simulations, for instance in R, e.g. drawing samples from a distribution, it’s best to set a random seed via the function set.seed in order to have reproducible results. The function has no default value. I think I mostly use set.seed(1). Last week I received an R script from a colleague in which he used a weird number in set.seed (maybe a phone number? or maybe he let his fingers type randomly?), which made me curious about the usual seed values. As in my blog post about initial commit messages I used the Github API via the gh package to get a very rough answer (an answer seedling from the question seed?).

From Github API search endpoint you can get up to 1,000 results corresponding to a query which in the case of set.seed occurrences in R code isn’t the whole picture but hopefully a good sample. I wrote a function to treat the output of a query to the API where I take advantage of the stringr package. I just want the thing inside set.seed() from the text matches returned by the API.

get_seeds_from_matches <- function(item){
  url <- item$html_url
  matches <- item$text_matches
  matches <- unlist(lapply(matches, "[[", "fragment"))
  matches <- stringr::str_split(matches, "\\\n", simplify = TRUE)
  matches <- stringr::str_extract(matches, "set\\.seed\\(.*\\)")
  matches <- stringr::str_replace(matches, "set\\.seed\\(", "")
  seeds <- stringr::str_replace(matches, "\\).*", "")
  seeds <- seeds[!]
  tibble::tibble(seed = seeds,
             url = rep(url, length(seeds)))

After that I made the queries themselves, pausing every 30 pages because of the rate limiting, and adding a try around the call in order to stop as soon as I reached the 1,000 results. Not a very elegant solution but I wasn’t in a perfectionnist mood.

Note that the header "Accept" = 'application/vnd.github.v3.text-match+json' is very important, without it you wouldn’t get the text fragments in the results.

seeds <- NULL

ok <- TRUE
page <- 1
  matches <- try(gh("/search/code", q = "set.seed&language:r",
                    .token = Sys.getenv("GITHUB_PAT"),
                    .send_headers = c("Accept" = 'application/vnd.github.v3.text-match+json'),
                    page = page), silent = TRUE)
  ok <- !is(matches, "try-error")
    seeds <- bind_rows(seeds, bind_rows(lapply(matches$items, 

  page <- page + 1
  # wait 2 minutes every 30 pages
  if(page %% 30 == 1 & page > 1){

save(seeds, file = "data/2017-04-12-seeds.RData.RData")
head(seeds) %>%
seed url

I got 984 entries, not 1,000 so maybe I lost some seeds in the process or the results weren’t perfect. The reason why I also added the URL of the script to the results was to be able to go and look at the code around surprising seeds.

Let’s have a look at the most frequent seeds in the sample.

table(seeds$seed) %>%
  broom::tidy() %>%
  dplyr::arrange(desc(Freq)) %>%
  head(n = 12) %>%
Var1 Freq
seed 312
1 134
123 60
iseed 48
10 47
13121098 28
ss 24
20 21
1234 18
42 18
123456 15
0 14

So the most prevalent seed is a mystery because I’m not motivated enough to go scrape the code to find if the seed gets assigned a value before, like in that tweet I saw today. I was happy that 1 was so popular, maybe it means I belong?

I was surprised by two values. First, 13121098.

dplyr::filter(seeds, seed == "13121098") %>%
  head(n = 10) %>% 
seed url

I went and had a look and it seems most repositories correspond to code learnt in a Coursera course. I have taken a few courses from that specialization and loved it but I don’t remember learning about the special seed, too bad. Well I guess everyone used it to reproduce results but what does this number mean in the first place? Who typed it? A cat walking on the keyboard?

The other number that surprised me was 42 but then I remembered it is the “Answer to the Ultimate Question of Life, the Universe, and Everything” . I’d therefore say that this might be the coolest random seed. Now I can’t tell you whether it produces better results. Maybe it helps when your code actually tries to answer the Ultimate Question of Life, the Universe, and Everything?

To leave a comment for the author, please follow the link and comment on their blog: Maëlle. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)