Playing with quantiles, part 1

March 8, 2011
By

(This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers)

A standard idea in extreme value theory (see e.g. here,
in French unfortunately) is that to estimate the 99.5%
quantile (say), we just need to estimate a quantile of level 95% for
observations
exceeding the 90% quantile.

In extreme value theory, we assume
that the 90% quantile (of the initial distribution) can be obtained easily, e.g. the empirical
quantile, and then, for the exceeding observations, we fit a Pareto
distribution (a Generalized Pareto one to be precise), and get a parametric quantile for the 95% quantile. I.e.

http://freakonometrics.blog.free.fr/public/perso2/quant01.gif

which can be written

http://freakonometrics.blog.free.fr/public/perso2/quant02.gif

So, an estimation of the cumulative distribution function is

http://freakonometrics.blog.free.fr/public/perso2/quant03.gif

and if we invert it, we get the popular expression for high level
quantiles,

http://freakonometrics.blog.free.fr/public/perso2/quant04b.gif

Hence, we do not really care about observations in the core of the
distribution.

And I was wondering if this can be transposed with quantile
regressions. Hence, I would like to get a quantile regression of level
90% (say) of http://freakonometrics.blog.free.fr/public/perso2/qqq06.gif
given http://freakonometrics.blog.free.fr/public/perso2/qqqo5.gif,
based on observations http://freakonometrics.blog.free.fr/public/perso2/qqq04.gif‘s,
but all
observations such that http://freakonometrics.blog.free.fr/public/perso2/qqq07.gif for
some http://freakonometrics.blog.free.fr/public/perso2/qqq08.gif are
missing. More precisely,
I have the following sample (here half of the observations are missing),

Assume that we know that I have observations below the http://freakonometrics.blog.free.fr/public/perso2/qqq06.gif
quantile of level 25%, and above the http://freakonometrics.blog.free.fr/public/perso2/qqq06.gif
quantile of level 75%.
If I want to get the 90% quantile regression, and the 10% quantile, the
code is simply,

library(mnormt)
library(quantreg)
library(splines)
set.seed(1)
mu=c(0,0)
r=0
Sigma <- matrix(c(1,r,r,1), 2, 2)
Z=rmnorm(2500,mu,Sigma)
X=Z[,1]
Y=Z[,2]
 
base=data.frame(X,Y)
plot(X,Y,col="blue",cex=.7)
I=(Y>qnorm(.25))&(Y<qnorm(.75))
baseI=base[I==FALSE,]
points(X[I],Y[I],col="light blue",cex=.7)
abline(h=qnorm(.25),lty=2,col="blue")
abline(h=qnorm(.75),lty=2,col="blue")
u=seq(-5,5,by=.02)
reg=rq(Y~X,data=base,tau=.05)
lines(u,predict(reg,newdata=data.frame(X=u)),lty=2)
reg=rq(Y~X,data=baseI,tau=.05*2)
lines(u,predict(reg,newdata=data.frame(X=u)))

The graph is the following

Dotted lines – in black – are theoretical lines (if I had all
observations), and plain lines are (where half of the sample if
missing). Instead of a standard linear quantile regression, it is also
possible to try a spline regression,

So obviously, if I miss something in the middle, that’s no big deal, doted and plain lines are here extremely close.
But what if observations http://freakonometrics.blog.free.fr/public/perso2/qqqo5.gif and
http://freakonometrics.blog.free.fr/public/perso2/qqq06.gif
were correlated ? Consider a Gaussian random vector http://freakonometrics.blog.free.fr/public/perso2/qqq09.gif with
correlation http://freakonometrics.blog.free.fr/public/perso2/qqq10.gif
(here 0.6).

It looks like we overestimate the slope for high quantile, but not for
lower quantiles. So if observations are correlated, we have to be
cautious with that technique.
But why could that be interesting ? Well, because I wanted to run a
quantile regression on marathon results. But I could not get the
overall dataset (since I had to import observations manually, and I have to
admit that it was a bit boring). So I extracted finish times of the first
10% athletes, and the latest 10%. And I was
wondering if it was enough to look at the 5% and 95% quantiles, based
on the age of the runner… To be
continued
.

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics - Tag - R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , ,

Comments are closed.