From a random generator to a sample function

Posted on May 14, 2013 by arthur charpentier in R bloggers | 0 Comments

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This week-end, I wrote a post since I had some trouble to generate a sample random sample with R, to reproduce one obtained by a co-author, with SAS (generated using Fishman and Moore (1982) used in function RANUNI). I was lucky since another contributor for that book, Christrophe Dutang, got the anwer to the last question I asked: is it possible to reproduce the random generator ? Yes, we can. And it is quite simple, if you use the appropriate library and the appropriate function,

> library(randtoolbox)
> a <- 397204094
> b <- 0
> m <- 2^(31)-1
> set.generator(name="congruRand", mod=m, mult=a, incr=b, seed=123) 
> runif(10)
 [1] 0.7503961 0.3209120 0.1783896 0.9060334 0.3571171
 [6] 0.2211140 0.7864383 0.3980819 0.1246652 0.1876858

If you check in the previous post this is exactly what SAS gave us (and that I could not reproduce by myself). But that was only one part of my problems, since the goal was actually to reproduce indices for a training subsample for credit scoring issues.

I have to admit that I had never though about it before: how should we write a sample function? If values can be replaced, that is fine, we just have to split the unit interval correctly. Like

> set.seed(95)
> (U=runif(10))
 [1] 0.15171155 0.57584087 0.05309844 0.07044485 0.48887914 0.15276707
 [7] 0.37405684 0.30006096 0.96997126 0.30373498
> set.seed(95)
> (R=sample(0:99,size=10,replace=TRUE))
 [1] 15 57  5  7 48 15 37 30 96 30

Here, we just truncate from the values obtained from the random generator. And that is just fine. But how do we write a code to sample without replacements? I mean, how do you get that :

> (S=sample(0:99,size=10,replace=FALSE))
 [1] 15 57  5  6 46 14 35 27 89 92

My initial idea was to use the following technique. The first value is easy to get: we just split the unit interval into 100 subdivision (as before since for the first value, replacement or not, we don’t care) and see in which interval the random value is. And we remove that value from our sample. Then, we split the unit intervall into 99 subdivision, and see in which interval the random value is. It is the 10th? Fine, then our second value is the 10th from our sample (the first value has been removed). Then we split the unit interval in 98 subdivision, etc. The code I wrote to produce that algorithm was the following,

> mysample1=function(N,unif){
+  n=length(N)
+  size=length(unif)
+  V0=N[trunc(unif[1]*n)+1]
+  N=N[-which(N==V0)]
+  V=V0
+  for(i in 2:length(unif)){
+    V0=N[trunc(unif[i]*(n-i+1))+1]
+    N=N[-which(N==V0)]
+    V=c(V,V0)}
+ return(V)}

Unfortunetely, I could not reproduce the sample obtained with the R function,

> mysample1(0:99,unif=U)
 [1] 15 58  5  7 49 17 39 31 97 32
> S
 [1] 15 57  5  6 46 14 35 27 89 92

Since Christrophe is an expert on random generators, I did ask him, one more time. And he came up with the following code,

> mysample2=function(N,unif){
+   integerset=1:length(N)
+   result=rep(NA,length(unif))
+   for(i in 1:length(unif)){
+     intchosen=integerset[ceiling(U[i]*(length(N)-i+1))]
+     integerset[intchosen]=length(integerset)
+     integerset=integerset[-length(integerset)]
+     result[i]=intchosen}
+   return(N[result])}

which works just fine !

> mysample2(0:99,unif=U)
 [1] 15 57  5  6 46 14 35 27 89 92
> S
 [1] 15 57  5  6 46 14 35 27 89 92

So now, not only can we reproduce random numbers obtained with other software, we can also obtain the same samples indices, with or without replacement ! Thanks Christophe !

[May, 15th] Note that this is note the generator used in SAS, actually. In order to reproduce the sample function of SAS, the algorithm is much more simple, by clearly not that efficicient since we generate a random sample of size 100 (if we have 100 observations), and then, we keep the values associated to the indices of the 10 smallest (if we want a sample of size 10). The code could be

> mysample3=function(N,unif,size){
+ q=sort(unif)[size]
+ return(N[U<=q])}

> library(randtoolbox)
> a <- 397204094
> b <- 0
> m <- 2^(31)-1
> set.generator(name="congruRand", mod=m, mult=a, incr=b, seed=123) #OK

> U=runif(100)
> mysample3(1:100,U,size=10)
 [1] 27 37 47 59 60 71 75 82 84 87

Thanks Jean-Philippe for the idea (which works).

Arthur Charpentier

Arthur Charpentier, professor in Montréal, in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1. Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.