Strange behavior of correlation estimation

[This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Gaussian vector is extremely interesting since it remains Gaussian when conditioning. More precisely, if http://freakonometrics.blog.free.fr/public/perso4/ECC16.gif is a Gaussian random vector, then the conditional distribution of http://freakonometrics.blog.free.fr/public/perso4/ECC15.gif is also Gaussian. Further, it is possible to derive explicitly the covariance matrix. And interesting result, that one will not depend on the conditioning. Here, it will we

http://freakonometrics.blog.free.fr/public/perso4/ECC12.gif

where the covariance matrix of the random vector was

http://freakonometrics.blog.free.fr/public/perso4/ECC11.gif

i.e. it is the Schur complement of bloc http://freakonometrics.blog.free.fr/public/perso4/ECC13.gif in http://freakonometrics.blog.free.fr/public/perso4/ECC14.gif. For instance consider an Gaussian vector in dimension 3,

http://freakonometrics.blog.free.fr/public/perso4/ECC33.gif

It is possible to derive explicitly the correlation of http://freakonometrics.blog.free.fr/public/perso4/ECC04.gif

> library(mnormt)
> SIGMA=matrix(c(1,.6,.7,.6,1,.8,.7,.8,1),3,3)
> SIGMA11=SIGMA[1:2,1:2]
> SIGMA21=SIGMA[3,1:2]
> SIGMA12=SIGMA[1:2,3]
> SIGMA22=SIGMA[3,3]
> S=SIGMA11-SIGMA12%*%solve(SIGMA22)%*%SIGMA21
> S[1,2]/sqrt(S[1,1]*S[2,2])
[1] 0.09335201

This concept is discussed in a paper on Conditional Correlation by Kunihiro Baba, Ritei Shibata and Masaaki Sibuya. But what about statistical inference ? In order to see what happens, let us run some simulations, and condition with http://freakonometrics.blog.free.fr/public/perso4/ECC20.gif in some deciles,

> k=10
> correlV=rep(0,k)
> X=rmnorm(n=10000000,varcov=SIGMA)
> QZ=qnorm((0:k)/k)
> ZQ=cut(X[,3],QZ)
> LZ=levels(ZQ)
> correl=rep(NA,k)
> for(i in 1:k){
+ 	correl[i]=cor(X[ZQ==LZ[i],1:2])[1,2]
+ }
> barplot(correl,names.arg=LZ,col="light blue",ylim=range(correl))

or some percentiles,

> k=100

In the middle it looks fine (we can look at it more carefully and fit a spline curve), but for very large values of http://freakonometrics.blog.free.fr/public/perso4/ECC20.gif, then we observe strange patterns (the graph below is based on 100 million simulations),

Note that for all estimates (each bar), there are exactly the same number of observations since when conditioning, we keep only observations such that http://freakonometrics.blog.free.fr/public/perso4/ECC20.gif is between two consecutive percentile (hence, we always keep 1% of the data). If anyone has any idea about what happens, I’d be glad to get explanations… At first, I thought it would be a speed of convergence issue… But if we look at the impact of the number of simulations used in the fifth decile,

(up to 1 million simulations, i.e. 100,000 conditional observations used to estimate the correlation). If we look at the third decile, it converges,

for the second one, it might be a bit slower,

but for the first decile…  either it is extremely slow, or it has a strong bias. But I don’t get why…

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)