The odds of a cluster of airplane accidents
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recently, there have been a lot of airplane accidents.
- July, 17th 2014, Hrabove, Ukraine, Malaysia Airlines, Boeing 777, fatalities 298 (/298)
- July, 23rd 2014, Magong, Taiwan, TransAsia Airways, ATR 72-500, fatalities 47 (/58)
- July, 24th 2014, Aguelhok, Mali, Air Algerie, Mc Donnell Douglas MD-83, fatalities 116 (/116)
It is simple to find a lot of datasets about airplane crashes. For instance on http://ntsb.gov/aviationquery. The dataset is nice, with a lot of information,
> planes=read.table( + "cbad3ca6-6b8f-4c98-9ee0-601faAviationData.txt", + sep="|",header=TRUE)
for instance the exact location of the crashes,
> library(maps) > map("world", interior = FALSE) > points(planes$Longitude,planes$Latitude, + pch=19,cex=planes$Total.Fatal.Injuries/50, + col="red")
But I could not find recent major accidents on that (supposed to be updated) dataset.
> X=as.Date(as.character(planes$Event.Date), + "%m/%d/%Y") > Y=planes$Total.Fatal.Injuries > plot(X,Y,type="h")
It looks like it is an american dataset, so maybe domestic flights in some countries are not in this dataset?
Let us try another one, such as https://opendata.socrata.com/
> planes=read.table( + "Crashes_and_Fatalities_Since_1908-b.csv", + sep=",",header=TRUE) > X=as.Date(as.character(planes$Date),"%m/%d/%Y") > Y=planes$Fatalities > plot(X,Y,type="h")
It is not an updated dataset, but for the purpose of studying the process of airplane accident, it should be just fine. In the recent years, it seems like a stationary process. If we focus on the past 35 years, we have
> subplanes=planes[X>as.Date("01/01/1980", + "%m/%d/%Y"),] > X=as.Date(as.character(subplanes$Date), + "%m/%d/%Y") > Y=subplanes$Fatalities > plot(X,Y,type="h")
Let us study the process of accident arrival. I will keep only airplane accidents with – say – 10 fatalities, or more (it is purely arbitrary, but we need some threshold here).
> X=X[!is.na(Y)] > Y=Y[!is.na(Y)] > T=X[Y>10]
In actuarial science, the standard process we use to model accidents is the Poisson process. Meaning that the time between two (consecutive) accidents is a random variable with an exponential distribution (further, those random variables are independent). If we look at the cumulative distribution function, we get
> W=as.numeric(diff(T)) > mean(W) [1] 12.66942 > plot(ecdf(W)) > u=0:100 > lines(u,pexp(u,1/mean(W)),col="red",lwd=2)
and the histogram is
> hist(W,breaks=0:100,probability=TRUE) > lines(u,dexp(u,1/mean(W)),col="red",lwd=2)
The red line is the exponential distribution, with the same mean. We can observe that we have too many zero’s here (i.e. accidents on the same day). If we remove them, it is slightly better,
> Tu=unique(T) > W=as.numeric(diff(Tu)) > mean(W) [1] 13.2974 > hist(W,breaks=0:100,probability=TRUE) > lines(u,dexp(u,1/mean(W)),col="red",lwd=2)
(but to be honest, I have no real motivation to remove those zeros… it is possible to have two airplane crashes on the same day – at different locations, i.e. not necessarily collisions). An alternative is to consider only ‘major’ accidents, with more than 40 casualties.
> T=X[Y>40] > W=diff(T)
Since we have exponentially distributed inter-arrival duration, and the assumption of independence between duration seems valid
> acf(W)
we will consider here that an homogeneous Poisson process is a valid model. Above 40 fatalities, the average time between two major accidents is a bit more than a month,
> W=as.numeric(diff(T)) > mean(W) [1] 36.22973
so the probability to have (at least) 3 accidents within 8 days (if we use the proportionality property of the Poisson process, i.e. the number of catastrophes between times and is a Poisson variable ) is
> 1-ppois(2,8/mean(W)) [1] 0.001521955
i.e. a bit more than 1.5 over 1000. Over 8 years, if we consider some blocks of 8 days, what could be the probability that such a cluster of crashed does not occur? The probability not to have 3 crashes, over 365 blocks of 8 days is
> p=1-ppois(2,8/mean(W)) > (1-p)^365 [1] 0.5735347
meaning that over 8 years, there are more than 40% chance to have such a cluster. I think we might call that a coincidence…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.