Date of death, birthday and Elvis Presley

Posted on June 18, 2012 by arthur charpentier in R bloggers | 0 Comments

[This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

10 days ago, a study published on http://www.annalsofepidemiology.org/ mentioned that “Death has a preference for birthdays” (as claimed in the title). The conclusion of the paper is that, in general, birthdays do not evoke a postponement mechanism but appear to end up in a lethal way more frequently than expected (“anniversary reaction”). Well, this is not new, and several previous articles have mentioned that point, e.g. Angermeyer et al. (1987).

I found the idea interesting since in demography, there is a large literature trying to extrapolate death rates from discrete to continuous time. Extrapolation are usually extremely smooth. But none of them integrate that aspect of mortality precisely on the birthday. The problem is that it is rather difficult to say something since datasets with individual observations are rare, online.

But yesterday, @coulmont sent me a tweet mentioning a website. I do not know if this is legal (even if some explanations are given), but I will mention courtesy of http://ssdmf.info/. It is a so-called Social Security Death Master File, containing individual informations about deaths in the US, as well as geographic information (as described on http://www.ssa.gov/), for people having a social security number.

With R, it is possible to work on those files (even they are huge, with tens of millions observations). For instance, we can check who is inside.

> elvis=scan("ssdm2",skip=22371720,n=1,what="character",sep=",")
> elvis
[1] " 409522002PRESLEY         ELVIS     0800197701081935  "

If you believe that Elvis is dead, you might agree that this database can be accurate (or at least, not too bad). And further, we can see here how to read the result: Elvis was born on January 8, 1935 (8 last digits), and died on August 16, 1977 (8 digits before). Obviously here, there are some problems with the dataset (we do not have the day of the death of Elvis). So here, we remove all the observations that do not give us proper dates. Then, the idea is to assume that the person died in 2000 (or any year since the point is to focus on days and months). Then, we count the number of days between the day of death and the birthday in 2001 (that would have been after) and the one in 2000 (that was either before or after the death), so that we can derive the number of days after the birthday,

dates=substr(base,66,81)
death=as.Date(substr(dates,1,8),"%m%d%Y")
birth=as.Date(substr(dates,9,16),"%m%d%Y")
indice=is.na(death)|is.na(birth)
mean(indice)
mdeath=substr(dates,1,2)
ddeath=substr(dates,3,4)
mbirth=substr(dates,9,10)
dbirth=substr(dates,11,12)
indice=which(ddeath!="00")
birth1=as.Date(paste(mbirth[indice],
dbirth[indice],"2000",sep=""),"%m%d%Y")
birth2=as.Date(paste(mbirth[indice],
dbirth[indice],"2001",sep=""),"%m%d%Y")
death=as.Date(paste(mdeath[indice],ddeath[indice],
"2000",sep=""),"%m%d%Y")
k=length(indice)
diffday=cbind((as.numeric(death-birth1))[1:k],
(as.numeric(death-birth2))[1:k])
DIFF=apply(diffday,1,function(x) {min(x[x>=0])})

What we have here is the number of days following the previous birthday. If we look at the distribution of that number of days, we obtain

counts=table(DIFF)
plot(as.numeric(names(counts)),
as.numeric(counts))
counts["0"]/(mean(counts[100:200]))
> counts["0"]/(mean(counts[100:200]))
0
1.121261

Thus, the death excess on the day of birth was around 12%, which is rather close to the one obtained from the Swiss mortality statistics 1969–2008 (in Ajdacic-Gross et al. (2012)). Note that here, we just play with a small subset of the entire dataset,

That database is probably extremely interesting, except that it suffers a huge selection bias, since only dead people are in that database. So it might be useless if we wish to study life expectancy of people named Bill versus people named Georges (that was something I wanted to investigate initially). But we’ll see what else we can do with it (since Ewen have been able to write some code to go through that huge dataset).

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)