Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Ethan Siegel wrote a post entitled The Math of the Fastest Human Alive five years ago, using regressions. An alternative is too use extreme value models (I wrote a post a long time ago on the maximum length of a tennis match using extreme value theory a few years ago). In 2009, John Einmahl and Sander Smeets wrote a great article entitled ultimate 100m world records through extreme-value theory. The article is extremely interesting, and the aim of this post is to illustrate what was done in the article with some nice graphs (with some codes written by Mohamed Amine, student in my MAT8595 course) to visualize inference aspects. As in ultimate 100m world records through extreme-value theory, consider the fastest personal best times that were set in a certain period (each athlete only appears once on the dataset). As in the paper, consider observation after 01-01-1991, until 19-06-2008. One can update the dataset to take into account Usain Bolt’s 9:58 record, I’d love to see how the estimates change actually (one can also look at @tomroud‘s great post, published in 2009 on convexity issues – in French – about Usain Bolt’s records). So, here is the data,

```> m1=read.csv("http://freakonometrics.free.fr/tm.csv",sep=";",header=TRUE)
> n=nrow(m1)
> m2=m1[,1]
> M=sort(m2,decreasing=TRUE)
> plot((1:n)/(n+1),M, ylab='Time (sec.)', xlab='Athletes',
+ main="Quantile function, 100m men records",type='l')``` The problem is that, in extreme value models, we usually focus on the maximum of a sample, not the minimum. A great idea is not to study tail behavior of the distribution of the times, but the distribution of the speed,

```> X=360/M
> plot((1:n)/(n+1),X, ylab='Speed (km/h)', xlab='Athletes',
+ main="Quantile function, 100m men records",type='l')``` Now, to the idea is that the associated distribution is in the max-domain of attraction of Weibull distribution (and therefore, there should be some upper limit). So we need to estimate the tail index using an appropriate estimator, such as the moment based estimator introduced in Dekkers, Einmahl and de Haan (1989),

with

```> gamma=rep(NA,n-2)
> a=rep(NA,n-2)
> b=rep(NA,n-2)
> for(k in 2:n-1){
+   mn=matrix(NA,k,2)
+     for(i in 0:k-1){
+       mn[i+1,]=cbind(log(X[n-i])-log(X[n-k]),(log(X[n-i])-log(X[n-k]))^2)
+     }
+   mn=cbind(mean(mn[,1]),mean(mn[,2]))
+   gamma[k-1]=mn[,1]+1-(1-mn[,1]^2/mn[,2])^(-1)/2
+   a[k-1]=(1-min(0,-.2))*X[n-k]*mn[,1]
+   b[k-1]=X[n-k]
+ }```

Based on that vector, we can plot it,

```> gammaini1=mean(gamma[50:80])
> gammaini2=mean(gamma[110:200])
> plot(gamma, xlab='k',xlim=c(29,n-2),
+ ylim=c(-.5,0),type='l',col='black',xaxt = "n")
> usr=par('usr')
> rect(0 ,gammaini1,usr,gammaini2,border=NA,col="lavender")
> rect(50,gamma,80,gamma,border='red',col="light blue",lty='dotted')
> rect(110,gamma,200,gamma,border='red',col="light pink",lty='dotted')
> lines(gamma,col='black', lwd=2)
> segments(x0=c(50,80,110,200,80,50,110,200), y0=c(gamma,gamma,
+ gamma,gamma,gamma,gamma,gamma,gamma),
+ x1=c(50,80,110,200,900,900,900,900),y1=c(1,1,1,1,gamma,
+ gamma,gamma,gamma),lty='dotted', col='red')
> abline(h=c(gammaini1,gammaini2,-.2),lty=
+ c('dotdash','dotdash'),xlim=c(600,n),lwd =c(1,1,2), col=c('blue','blue','green'))
> axis(3, col="red",col.axis = "red",at=c(50,80,110,200), tick=FALSE,las=1,line=-.8,font.axis=9)
> axis(4, col="red",col.axis = "red",at=c(round(gamma,3),
+ round(gamma,3),round(gamma,3),round(gamma,3)),
+ tick=FALSE,las=3,line=-.8,font.axis=9)
> axis(2, col="red",col.axis = "blue",at=c(round(gammaini1,3),
+ round(gammaini2,3)), tick=FALSE,las=3,line=-.8,font.axis=9)
> axis(1, col="black",col.axis = "black",at=seq(0,773,by=40),
+ tick=TRUE,las=1,line=0,font.axis=1)
> axis(2, col="red",col.axis = "green",at=-.2, tick=FALSE,
+ las=3,line=1.1,font.axis=9)
> mtext("The moment estimator versus k for 100-m men",
+ 3, line=2, col="black", cex=1.2)
> mtext(expression("N.B: The line in blue and green
+ show, respectively, initial and final estimated "(gamma)), 1, Because of all the ties, the graph is extremely erratic. A first idea to get it smoother is to add some noise, to the times. Not a big one, just to avoid ties,

```> m1=read.csv("http://freakonometrics.free.fr/tm.csv",sep=";",header=TRUE)
> n=nrow(m1)
> m2=m1[,1]+rnorm(n,sd=.005)```

(and then, use the previous code) An alternative is to use the linear interpolation suggested in the original paper,

```> M=rep(NA,n)
> for (i in 1:n){
+   c=which(m1[,1]==m1[i,1])
+    if (length(c)!=0){
+      k=0
+      for(j in c:c[length(c)]){
+        k=k+1
+         M[j]=m1[i,1]-.01/2+.01*(2*k-1)/(2*length(c))
+      }
+      i=i+length(c)
+    }
+ }
> M=sort(M,decreasing=TRUE)
> X=360/M``` Nice, isn’t it?

Here, we can clearly see that the tail index is negative, meaning that the distribution of the speeds is in the max-domain of attraction of the Weibull distribution, and thus, the support of the distribution has an upper bound.

But if we want to get a proper estimation of the tail index, which  should we consider? As suggested again in the original paper, consider the idea of Beirlant, Dierckx & Guillou (2005), to derive an optimal ,

```> gammaini=seq(gammaini1,gammaini2,by=.01)
> s=length(gammaini)
> c=matrix(NA,1,s)
> amse=matrix(NA,s,1)
> amsea=matrix(NA,n-2,1)
> for(k in 2:(n-1)){
+   uh=matrix(NA,k,1)
+   p1=matrix(NA,k-1,s)
+   p2=matrix(NA,k-1,s)
+   for(j in 1:k){
+     mn=matrix(NA,j,1)
+     for(i in 0:(j-1)){
+       mn[i+1,]=log(X[n-i])-log(X[n-j])
+     }
+     uh[j]=X[n-j]*mean(mn)
+   }
+   for(i in 1:s){
+     for(j in 1:k-1){
+       z=(j+1) * log(uh[j] / uh[j+1])
+       ztilde=z-gammaini[i]
+       p1[j,i]=ztilde*(j/k)^(-gammaini[i])
+       p2[j,i]=(j/k)^(-2*gammaini[i])
+     }
+     c[,i]=sum(p1[,i])/sum(p2[,i])
+     avar=((1-gammaini[i])^2*(1-2*gammaini[i])*
+ (1-gammaini[i]+6*gammaini[i]^2)) / (k*(1-3*gammaini[i])*(1-4*gammaini[i]))
+     abiais=((1-2*gammaini[i])*c[,i]) / (gammaini[i]*(1-gammaini[i]))
+     amse[i]=avar+abiais^2
+   }
+   amsea[k-1]=mean(amse)
+ }```

To visualize this asymptotic mean square error, consider the following code

```> plot(amsea, xlab='k', ylab='Average AMSE',
+ xlim=c(29,n-2),ylim=c(0,1),type='l',xaxt = "n")
> usr=par('usr')
> rect(45,usr,70,amsea,border='red',col="light blue",lty='dotted')
> rect(101,usr,175,amsea,border='red',col="light pink",lty='dotted')
> lines(amsea,col='black', lwd=2)
> segments(x0=c(45,70,101,175), y0=c(amsea,
+ amsea,amsea,amsea), x1=c(45,70,101,
+ 175), y1=c(2,2,2,2),lty='dotted', col='red')
> axis(3, col="red",col.axis = "red",at=c(45,70,101,175),
+ tick=FALSE,las=1,line=-.8,font.axis=9)
> axis(1, col="black",col.axis = "black",at=seq(0,771,
+ by=40), tick=TRUE,las=1,line=0,font.axis=1)
> mtext("Average AMSE of the moment estimator for 100-m men",
+ 3, line=2, col="black", cex=1.2)``` So finally, let us get back to the initial problem, of estimating the endpoint of the speed distribution. Because we know, from some extreme value theorems, that the endpoint  satisfies

The code to visualize it is simply

```> gammaf1=mean(gamma[45:70])
> gammaf2=mean(gamma[101:175])
> gammaf=(gammaf1+gammaf2)/2
> gammafA=-0.2
> x=matrix(NA,n-2,1)
> x=b-a/gammafA
> endpoint_s=mean(x[which(x<38 & x>37.6)])
> endpoint_t=100*3.6/endpoint_s
> omega=((1-gammafA)^2*(1-3*gammafA+4*gammafA^2)) /
+ (gammafA^4*(1-2*gammafA)*(1-3*gammafA)*(1-4*gammafA))
> ic_s=matrix(NA,n-2,2)
> for(k in 2:n-1){
+   ic_s[k-1,]=cbind(x[k-1]-1.96*a[k-1]^2*omega/k,x[k-1]+1.96*a[k-1]^2*omega/k)
+ }
> ic_t=100*3.6/ic_s
> ic_t_m=mean(ic_t[,2])
> plot(x, xlab='k', ylab='Endpoint',xlim=c(29,n-1),ylim=
+ c(35.5,39.5),type='l',col='black',xaxt = "n", lwd=2)
> usr=par('usr')
> rect(0 ,37.6,usr,38,border=NA,col="lavender")
> lines(x,col='black', lwd=2)
> grid(col = "gray", lty = "dotted", lwd = .1, nx=40)
> abline(h=c(37.6,38,endpoint_s),lty=c('dotted',
+ 'dotted','dotdash'),xlim=c(600,n),lwd =c(1,1,2), col=c('blue','blue','green'))
> lines(ic_s[,1],col='red', lty='dotted', lwd=1)
> lines(ic_s[,2],col='red', lty='dotted', lwd=1)
> axis(2, col="red",col.axis = "blue",at=c(37.6,38),
+ tick=FALSE,las=3,line=-.8,font.axis=9)
> axis(1, col="black",col.axis = "black",at=seq(0,771,by=40),
+ tick=TRUE,las=1,line=0,font.axis=1)
> axis(2, col="red",col.axis = "green",at=
+ round(endpoint_s,2), tick=FALSE,las=3,line=1.1,font.axis=9)
> mtext(expression("100-m men estimated endpoint
+ (speed) in km/h with fixed "(gamma)), 3, line=1, col="black", cex=1.2)
> rect(390,39,800,39.5,border='black',col="dark gray",lty='solid')
> legend(390,39.5, legend = c("Confidence intervals limits",
+ "Final choice of the endpoint",'The estimated ultimate world records',
+ 'The estimated word records (endpoint)'),pch = 15,pt.cex = 1.5,
+ col = c("red","green",'blue','black'), bty = "n", cex = .8,
+ text.col = c("red","green",'blue','black'), horiz = FALSE,
+ x.intersp=.8,y.intersp=.8)``` and because no one really cares about the speed, let us look at the estimated endpoint for times (obtained on data prior to Usain Bolt’s amazing 9:58) Actually, we’re not that bad, are we?