# Baby Boomers

December 22, 2015
By

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

By Chris Campbell

As another new baby card is passed around the office, and the latest cute baby pictures are emailed out, a discussion is underway. Could it be true? Something in the water? An elixir of fertility that should be bottled and sold to desperate couples for enormous profit? Is Mango having some crazy baby boom?!

Mango has been having something of a population explosion. In the first seven years of Mango there were only a couple of babies. Since moving to Methuen Park it seems like there’s been one every few months. So is there really something magical happening here?

Of course not. Now demonstrating this kind of data is an absolute HR nightmare, so please forgive me if I’m somewhat coy with revealing the raw datasets. However, I’d like to share a small analysis I did as an example of how to set up a dataset for survival analysis with multiple events. To investigate the hypothesis that there’d been a change in the rate of baby production, I had two datasets. One, `babies`, has the Date of birth of the child, as well as some covariates. Splitting births into events in the first 3432 days (9.4 years) and most recent 1268 days (3.5 years) shows that 40% the number of babies were born before moving to our current home, Methuen Park, compared to when at the new offices. In fact, the rate jumps from 0.4 babies/year to 2.9 babies/year!

```names(babies)
# [1] "Name"   "Date"   "Sex"    "Parent"
library(dplyr)
day1 <- as.Date("2002-10-01")
summarize(
.data = group_by(x = babies, Methuen = Date - day1 > 3432),
Number = length(Date))
# Source: local data frame [2 x 2]
#
#   Methuen Number
#     (lgl)  (int)
# 1   FALSE      4
# 2    TRUE     10
```

I helped Francis Smart develop his stick man script into a small package, stick, that you can install from GitHub. This allows me to demonstrate graphically how absolutely remarkable this step change in birth rate truly is.

There is a `plotStick` function in the package, but since I want to demonstrate which time period is at Methuen without obscuring the sticks, I added them to a blank plot with `pointsStick`. Each stick represents one additional birth. The sticks wear dresses if they are female, and are coloured to show unique parents. The mood of the face is random, to represent the changeable mood of a newborn. Sleeping stick is not yet implemented.

R Code for Stick Plot
```library(devtools)
# install stick package
install_github("EconometricsBySimulation/R-Graphics/Stick-Figures/stick")
library(stick)

# blank plot
plot(x = babies\$Date, y = seq_len(nrow(babies)),
main = "Cumulative Baby Numbers for Mango Solutions Employees",
xlab = "Date", ylab = "Total Babies", type = "n",
ylim = c(0, 15))

# highlight time at Methuen Park
methuen <- rainbow(1, s = 0.5, start = 0.2)
polygon(x = day1 + c(3432, 4700, 4700, 3432),
y = c(-2, -2, 17, 17),
border = methuen, col = methuen)
box()

pointsStick(x = babies\$Date, y = seq_len(nrow(babies)),
# female babies wear a dress I guess
gender = babies\$Sex,
# colour by parent
col = rainbow(length(unique(babies\$Parent)),
s = 0.9)[factor(babies\$Parent)],
# baby mood is unpredictable at best
face = sample(x = c("happy", "neutral", "surprised",
size = nrow(babies),
replace = TRUE, prob = (5:1) / 15),
# babies rarely, if ever keep hats on
hat = FALSE)

text(x = as.Date(c("2006-03-30", "2012-05-29")),
y = c(14, 14),
labels = c("Before Methuen", "At Methuen"),
pos = 4)
```

Plotting the number of babies born, it certainly looks like there’s a correlation between Methuen Park and babies. However, examining employee head count during the two periods trivially shows that baby birth number correlates with number of employees. Baby births and employee numbers are 40% the level at Methuen Park.

```names(employeeno)
# [1] "Date"  "Employees"

summarize(
.data = group_by(x = employeeno, Methuen = Date - day1 > 3432),
Number = median(Employees))
# Source: local data frame [2 x 2]
#
#   Methuen Number
#     (lgl)  (int)
# 1   FALSE     22
# 2    TRUE     52
```

However, for the sake of fun, we can see whether there is a difference in birth rates between these two periods. One way of measuring the occurrence of events is survival analysis. A lot of tools for survival analysis are available in the survival package. Following the common modelling idiom in R, the model is defined as a formula. Events go on the left hand side of the formula, and are coded as a Surv object (a matrix of time and event status). For data where only one event can be observed for a subject, only one event time is observed. But people who have had a baby are still at risk of having another baby!

For multiple events in the Surv object, the data need to be shaped so that the time at risk of each individual is marked as start time, end time and status: event observed or subject censored (no event observed). Subjects can occur at multiple records.

```library(data.table)
tab2 <- data.table(tab)
setkey(tab2, Employee)
tab2[c("Andy Nicholls", "Chris Campbell", "Chris Musselle"), ]
#         Employee Start Stop Status
#1:  Andy Nicholls  3031 4408      1
#2:  Andy Nicholls  4408 4595      0
#3: Chris Campbell  3361 3893      1
#4: Chris Campbell  3893 4479      1
#5: Chris Campbell  4479 4595      0
#6: Chris Musselle  4124 4595      0
```

In this experiment, subjects are at risk during two time periods. To divide this dataset into two groups based on site change at 3432 days since the start of Mango, we can use the `foverlaps` function from package data.table.

R Code for Splitting into Time Intervals
```# split date
dmethuen <- data.table(Start = 3432, Stop = 3432,
Status = 0, key = c("Start", "Stop"))
setkey(tab2, Start, Stop)

# records where split occurs flagged in Start/Stop columns
tab2 <- foverlaps(x = tab2, y = dmethuen, type = "any")
setkey(tab2, Employee)
tab2[c("Andy Nicholls", "Chris Campbell", "Chris Musselle"), ]
#    Start Stop Status       Employee i.Start i.Stop i.Status
# 1:  3432 3432      0  Andy Nicholls    3031   4408        1
# 2:    NA   NA     NA  Andy Nicholls    4408   4595        0
# 3:  3432 3432      0 Chris Campbell    3361   3893        1
# 4:    NA   NA     NA Chris Campbell    3893   4479        1
# 5:    NA   NA     NA Chris Campbell    4479   4595        0
# 6:    NA   NA     NA Chris Musselle    4124   4595        0

# bind new columns as new rows, using new Status
tab2 <- rbindlist(
list(
tab2[!is.na(Start),
list(Employee, Start = i.Start,
Stop, Status = Status)],
tab2[!is.na(Start),
list(Employee, Start = Start,
Stop = i.Stop, Status = i.Status)],
tab2[is.na(Start),
list(Employee, Start = i.Start,
Stop = i.Stop, Status = i.Status)]))

# add flag column for modelling
tab2[, Location := factor(x = Start >= 3432,
levels = c(FALSE, TRUE), labels = c("Greenways", "Methuen"))]
setkey(tab2, Employee)
tab2[c("Andy Nicholls", "Chris Campbell", "Chris Musselle"), ]
#         Employee Start Stop Status  Location
#1:  Andy Nicholls  3031 3432      0 Greenways
#2:  Andy Nicholls  3432 4408      1   Methuen
#3:  Andy Nicholls  4408 4595      0   Methuen
#4: Chris Campbell  3361 3432      0 Greenways
#5: Chris Campbell  3432 3893      1   Methuen
#6: Chris Campbell  3893 4479      1   Methuen
#7: Chris Campbell  4479 4595      0   Methuen
#8: Chris Musselle  4124 4595      0   Methuen
```

This allowed me to use the Methuen or not-Methuen as a possible covariate. I created a Cox Proportional Hazards fit of event status with Location as an explanatory variable. There was no improvement in the likelihood of the model by adding a Location effect. The rates of baby creation for a person at risk is constant, whether at Methuen Park or the old offices. As we suspected, the only difference, which we can reasonably infer to be causative, is the number of person-days at risk in the two groups.

```library(survival)
newBabyLocs <- coxph(
formula = Surv(Start, Stop, Status) ~ Location,
data = tab)
anova(newBabyLocs)
# Analysis of Deviance Table
#  Cox model: response is Surv(Start, Stop, Status)
# Terms added sequentially (first to last)
#
#           loglik Chisq Df Pr(>|Chi|)
# NULL     -48.341
# Location -48.341     0  1          1
```

This approach suggests that Location doesn’t influence risk of babies. Absence of evidence does not constitute evidence of absence, so we can’t completely rule out an effect. However, supporting evidence strongly suggests that there is a causal relationship between moving to larger offices at Methuen Park and Mango population size. The increased number of babies born simply reflects the increased number of subjects at risk. So if there is something in the water, the effect size must be very small.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...