Reproducibility and randomness

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

With Stéphane Tufféry, we were working this week on a chapter of a book, entitled Statistical Learning in Actuarial Science. The chapter should be based on R functions, and we wanted to reproduce some outputs he previously obtained with SAS. The good thing is that even complex functions (logistic regression, regression trees, etc) produce the same kind of outputs. But we found a problem that we could not fix: generating identical training subsets of observations… Out of 1,000 lines, we subsample about 600 lines. The problem is that we could not generate the same sets of indexes with R, and SAS. Even using similar random generators… (execpt if we want to extract 1 or 2 lines, no more).

Let us try to explain what’s going on (based on code produced by Stéphane). According to Eubank (2010), there are (at least) two generators of random numbers in SAS,

For instance, for the RAND function, if we generate a Gaussian sample – with Mersenne-Twister algorithm – the code should be

%LET SEED =6;
%LET NREP=10;
DATA TESTRANDOM ;
  CALL STREAMINIT(&SEED);
  DO REP = 1 TO &NREP;
   X = RAND ('NORMAL'); 
   OUTPUT;
  END;
RUN;
PROC PRINT DATA = TESTRANDOM ;
RUN ;

And we get here

Obs. REP X
1 1 2.10680
2 2 -0.25604
3 3 0.28692
4 4 -0.22806
5 5 1.34569
6 6 0.16341
7 7 -0.27788
8 8 0.02133
9 9 1.24050
10 10 1.01054

 

If we want a Uniform sample, it should be

%LET SEED =6;
%LET NREP=10;
DATA TESTRANDM ;
CALL STREAMINIT(&SEED);
DO REP = 1 TO &NREP;
  X = RAND ('UNIFORM'); 
  OUTPUT;
END;
RUN;
PROC PRINT DATA = TESTRANDOM ;
RUN ;
Obs. REP X
1 1 0.66097
2 2 0.48044
3 3 0.87849
4 4 0.19916
5 5 0.04838
6 6 0.19966
7 7 0.81353
8 8 0.53807
9 9 0.01105
10 10 0.53753

On good thing (so far) about the latest, is that Mersenne-Twister has been coded in R, in the RNGkind function

> RNGkind("Mersenne")
> set.seed(6)
> runif(10)
 [1] 0.64357277 0.91590888 0.09523258 0.29537280
 [5] 0.76993170 0.25589353 0.51789573 0.67784993
 [9] 0.14722782 0.70052604

But the output is different, even if we’re supposed to start, here, with the same seed. Now, if we want to make sure about what is done here, let us write our own codes of the Fishman and Moore (1982) algorithm (in order to reproduce the SAS output). The R version of

> a = 397204094      # RANUNI multiplier
> seed = 123         # seed
> n = 10             # sample size
> m = (2^31) - 1     # period
> for (i in (1:n-1)) {
+ seed = (a*seed)%%m
+ u = seed / m
+ print(u)
+ }
[1] 0.7503961
[1] 0.3209121
[1] 0.3453204
[1] 0.2683455
[1] 0.241798
[1] 0.9888181
[1] 0.3943279
[1] 0.9710172
[1] 0.001632214
[1] 0.921537

Let us now run a similar code with SAS,

%LET SEED =123;
%LET NREP=10;
DATA FRANUNI (KEEP = x) ;
seed = &SEED ;
DO REP = 1 TO &NREP;
  CALL RANUNI(seed, x);
  OUTPUT;
END;
RUN;
PROC PRINT DATA = FRANUNI ;
RUN ;

and we get the following output

Obs. x
1 0.75040
2 0.32091
3 0.17839
4 0.90603
5 0.35712
6 0.22111
7 0.78644
8 0.39808
9 0.12467
10 0.18769

It looks like here, indeed, we start with the same seed, since the first two numbers generated are similar. But then, it looks like we really have random numbers… If we change the seed, the first two numbers are similar, but that’s all.

We might be missing something trivial here, but we did not see it. So if anyone has a clue about reproducibility issues when generating random samples, with R and SAS, we are interested !

Arthur Charpentier

Arthur Charpentier, professor in Montréal, in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1.  Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.

More PostsWebsite

Follow Me:
TwitterLinkedInGoogle Plus

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)