[This article was first published on R – Exegetic Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In a previous post I showed that the data from www.baseball-reference.com support Malcolm Gladwell’s contention that more professional baseball players are born in August than any other month. Although this might be explained by the 31 July cutoff for admission to baseball leagues, it was suggested that it could also be linked to a larger proportion of babies being born in August.

In order to explore this idea I gathered data from http://www.cdc.gov/ for births in the USA between 1994 and 2014. These data as well as the baseball data have been published as a R package here. Install using

```> devtools::install_github("DataWookie/lifespan")
> library(lifespan)
```

Let’s explore the hypothesis regarding non-uniform birth months.

```> library(dplyr)
> group_by(births, month) %>% summarise(count = sum(count))
Source: local data frame [12 x 2]

month   count
(fctr)   (int)
1     Jan 6906798
2     Feb 6448725
3     Mar 7080880
4     Apr 6788266
5     May 7112239
6     Jun 7059986
7     Jul 7461489
8     Aug 7552007
9     Sep 7365904
10    Oct 7220646
11    Nov 6813037
12    Dec 7079453
```

There is definitely significant non-uniformity:

```> chisq.test(.Last.value\$count)

Chi-squared test for given probabilities

data:  .Last.value\$count
X-squared = 149000, df = 11, p-value <2e-16
```

We can dig into that a little deeper and see the total number of births between 1994 and 2014 broken down by month. The aggregate for August is certainly higher than any other month, but only marginally larger than that for July.

Delving still deeper we find that the monthly counts exhibit significant variation from year to year and that August has some appreciable outliers.

Specifically, August in 2006 and 2007 appear to have been bumper births months. Interesting!

```> group_by(births, year, month) %>% summarise(count = sum(count)) %>% ungroup() %>%
+   arrange(desc(count))
Source: local data frame [252 x 3]

year  month  count
(int) (fctr)  (int)
1   2007    Aug 391117
2   2006    Aug 388481
3   2007    Jul 380356
4   2008    Jul 376105
5   2006    Sep 375389
6   2008    Aug 374028
7   2007    Oct 370069
8   2005    Aug 370045
9   2009    Jul 369117
10  2008    Sep 368660
..   ...    ...    ...
```

Of course, a peak in overall births in August does not mean that there’s a direct causative link to the peak in professional baseball players’ births. But the contribution cannot be ignored.

To leave a comment for the author, please follow the link and comment on their blog: R – Exegetic Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)