Simulating the Birthday Problem with data derived probabilities
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
You've probably heard of the Birthday Paradox: it only takes a small gathering of people before it's quite likely that two of them share the same birthday. You can solve the problem analytically or with simulation, but usually in either case simplifying assumptions are made (no-one born on February 29, for example). Joe Rickert uses Revolution R Enterprise 6 to investigate whether these assumptions make a difference, and finds evidence for a “Summer Baby Boom” along the way. – Ed.
William Feller’s “An Introduction to Probability Theory and Its Applications” (Wiley 1950) is a beautiful book. Written with remarkable clarity, it is still very much worth reading. On page 33 (3rd ed. 1968) Feller gives a formula for calculating the probability that at least two people in a random sample of size r share a common birthday. Assuming 365 days in a year and a uniform probability of being born on any given day, Feller uses a simple combinatorial argument to derive his “first approximation” solution and notes that when r = 23 there is a better than 50% chance of a match. He also provides a formula for a numerical approximation suggesting that for “large” r the calculation should be done with logarithms. Apparently, in Feller’s time calculating the probability of at least one matching birthday for a given sample size was a bit of a pain.
Nevertheless, it is astounding that of all the problems and examples in Feller’s two-volume work the “birthday problem”, as we now call it, should have reached urban legend fame as an example of how non-intuitive probability can be. It’s on YouTube, NPR’s “Math Guy” has discussed it, and Trumbo et al. published a very elegant R based presentation in the 2004 Proceedings of the American Statistical Association, Session 345. In a recent blog post, Laura McLay poses the question: Have you simulated the Birthday Problem with and unequal birthday distribution?”. And so it is, having been given four years (1985 to 1988) of day of the week birth data (which my colleague, Sue Ranney, imported from the CDC for a talk she gave at Use-R last year) I find myself compelled to do the simulation. The small portion of the CDC file that I have looks like this:
Number of observations: 15252733 Number of variables: 5 Number of blocks: 32 Variable information: Var 1: DOB_DAY, Day of Month Type: integer, Low/High: (1, 31) Var 2: DOB_MM, Birth Month Type: integer, Low/High: (1, 12) Var 3: DOB_YY, Birth Year Type: integer, Low/High: (1985, 1988) Var 4: WEEKDAY, Day of Week Child Born 7 factor levels: Sunday Monday Tuesday Wednesday Thursday Friday Saturday Var 5: SEX, Sex of Infant 2 factor levels: Male Female
15M records makes this an ideal job for Revolution’s RevoScaleR package. (The code for all of the data preparation and probability calculations are in the script Heat Map.R.) The first step was to transform this information into convenient form for tabulating births by day of year and estimation probabilities. The data transformations including the calculation to assign births to a 366 day year were accomplished using the transforms option of the rxDataStep function. The logic computing day of year (DOY) is
DOY = DateVar – startOfYear + 1,
DOY = ifelse(DOB_YY != 1988 & DOB_MM > 2, DOY + 1, DOY)
where the variables are as given above and startOfYear contains the date of the first of the year for each observation
Tabulating number of births by day of year was done with the function rxCube that produces a cross tabulation for factor variables in long form. On my 4-core Dell laptop with 8GB of RAM it only takes about .3 seconds to tabulate 15 million records. Dividing by the total number of births gives an estimate of the probability of birth for each day. The mean probability (shown below) is very close to 1/366 as would be expected:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0006424 0.0026360 0.0027260 0.0027320 0.0028410 0.0031970
Figure 1 plots a regression line on the data that shows that the probabilities trend slightly upward as the year progresses. The heat map in Figure 2 shows the same features of the NPR heat map mentioned in Laura’s post: low probabilities in the beginning of January, high probabilities at the end of December and a booming baby season in the summer. Figure 3, a plot of the differences between male and female birth probabilities shows some interesting structure. The differences are small numbers, but it does appear as if the year starts out favoring girls, gives boys the edge in the summer and then goes girls again.
Figure 1
A brute force simulation of the birthday problem now becomes a simple exercise of using R’s sample function to simulate from the estimated probability distribution. The script Birthday Problem contains the code to do this using the RevoScaleR function rxExec to run the simulation in parallel.
rxExec(pbirthday2, rxElemArg(myArgs), taskChunkSize=20)
(bpirthday2 is a function that performs the simulation and rxElemArg passes the required arguments to pbirthday2.) rxExec allows you to run arbitrary R functions in a distributed fashion using all of available computational resources. In my case, the following line of code established a local compute context instructing rxExec to distribute the computations across the 4 available cores of my laptop.
rxSetComputeContext(RxLocalParallel())
For someone who happened to have a cluster available, running the same code on the cluster would only require changing the compute context. The following code snippets respectively show how to set the compute context to be a Microsoft HPC cluster or a Linux cluster.
Code snippet for Microsoft HPC cluster
myCluster <- RxHpcServer(headNode="cluster-head2",
revoPath=”C:\\Revolution\\ . . .”,
shareDir=”\\AllShare\\myName”,
dataPath=”C:\\data”)
rxSetComputeContext(myCluster)
Code snippet for Linux cluster
myCluster <- RxLsfCluster(shareDir="/usr/share/lsf/shares/myName",
dataPath=”/usr/local/data”)
rxSetComputeContext(myCluster)
Figure 4 shows the cumulative distribution of the probability of a match built by simulating sample sizes of 2 to 100 people. Note that the color coding indicates in which node (core) that part of the calculation would be done.
Figure 4
Using 10,000 random draws for each point it took about 30 seconds to run this simulation on my laptop. The probability of at least one match for a sample size of 23 people comes out to be .5087, not far from the value: 0. 507297234 that Trumbo got using R and Feller’s approximate analytic solution. This is hardly the last word on the Birthday Problem. (It would be nice to look at the number of births as a time series over more years of data.) Nevertheless, Feller’s first approximation solution is still looking pretty good.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.