**Freakonometrics » R-english**, and kindly contributed to R-bloggers)

This week-end, I wrote a post since I had some trouble to generate a sample random sample with R, to reproduce one obtained by a co-author, with SAS (generated using Fishman and Moore (1982) used in function RANUNI). I was lucky since another contributor for that book, Christrophe Dutang, got the anwer to the last question I asked: is it possible to reproduce the random generator ? Yes, we can. And it is quite simple, if you use the appropriate library and the appropriate function,

> library(randtoolbox) > a <- 397204094 > b <- 0 > m <- 2^(31)-1 > set.generator(name="congruRand", mod=m, mult=a, incr=b, seed=123) > runif(10) [1] 0.7503961 0.3209120 0.1783896 0.9060334 0.3571171 [6] 0.2211140 0.7864383 0.3980819 0.1246652 0.1876858

If you check in the previous post this is exactly what SAS gave us (and that I could not reproduce by myself). But that was only one part of my problems, since the goal was actually to reproduce indices for a training subsample for credit scoring issues.

I have to admit that I had never though about it before: how should we write a sample function? If values can be replaced, that is fine, we just have to split the unit interval correctly. Like

> set.seed(95) > (U=runif(10)) [1] 0.15171155 0.57584087 0.05309844 0.07044485 0.48887914 0.15276707 [7] 0.37405684 0.30006096 0.96997126 0.30373498 > set.seed(95) > (R=sample(0:99,size=10,replace=TRUE)) [1] 15 57 5 7 48 15 37 30 96 30

Here, we just truncate from the values obtained from the random generator. And that is just fine. But how do we write a code to sample *without replacements*? I mean, how do you get that :

> (S=sample(0:99,size=10,replace=FALSE)) [1] 15 57 5 6 46 14 35 27 89 92

My initial idea was to use the following technique. The first value is easy to get: we just split the unit interval into 100 subdivision (as before since for the first value, replacement or not, we don’t care) and see in which interval the random value is. And we remove that value from our sample. Then, we split the unit intervall into 99 subdivision, and see in which interval the random value is. It is the 10th? Fine, then our second value is the 10th from our sample (the first value has been removed). Then we split the unit interval in 98 subdivision, etc. The code I wrote to produce that algorithm was the following,

> mysample1=function(N,unif){ + n=length(N) + size=length(unif) + V0=N[trunc(unif[1]*n)+1] + N=N[-which(N==V0)] + V=V0 + for(i in 2:length(unif)){ + V0=N[trunc(unif[i]*(n-i+1))+1] + N=N[-which(N==V0)] + V=c(V,V0)} + return(V)}

Unfortunetely, I could not reproduce the sample obtained with the R function,

> mysample1(0:99,unif=U) [1] 15 58 5 7 49 17 39 31 97 32 > S [1] 15 57 5 6 46 14 35 27 89 92

Since Christrophe is an expert on random generators, I did ask him, one more time. And he came up with the following code,

> mysample2=function(N,unif){ + integerset=1:length(N) + result=rep(NA,length(unif)) + for(i in 1:length(unif)){ + intchosen=integerset[ceiling(U[i]*(length(N)-i+1))] + integerset[intchosen]=length(integerset) + integerset=integerset[-length(integerset)] + result[i]=intchosen} + return(N[result])}

which works just fine !

> mysample2(0:99,unif=U) [1] 15 57 5 6 46 14 35 27 89 92 > S [1] 15 57 5 6 46 14 35 27 89 92

So now, not only can we reproduce random numbers obtained with other software, we can also obtain the same samples indices, with or without replacement ! Thanks Christophe !

**[May, 15th]** Note that this is note the generator used in SAS, actually. In order to reproduce the sample function of SAS, the algorithm is much more simple, by clearly not that efficicient since we generate a random sample of size 100 (if we *have* 100 observations), and then, we keep the values associated to the indices of the 10 smallest (if we want a sample of size 10). The code could be

> mysample3=function(N,unif,size){ + q=sort(unif)[size] + return(N[U<=q])} > library(randtoolbox) > a <- 397204094 > b <- 0 > m <- 2^(31)-1 > set.generator(name="congruRand", mod=m, mult=a, incr=b, seed=123) #OK > U=runif(100) > mysample3(1:100,U,size=10) [1] 27 37 47 59 60 71 75 82 84 87

Thanks Jean-Philippe for the idea (which works).

**leave a comment**for the author, please follow the link and comment on his blog:

**Freakonometrics » R-english**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...