# Non-PD Matrices in R, Cont.

December 3, 2011
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Let me preface this post by saying I am getting frustrated with R.  The syntax is not intuitive and the performance for matrix operations is slow.  Using a free Matlab clone, I can get over 6 Gflops on things that R is doing at less than 2.  After this post, I will focus on the statistical functions of R and will do any matrix operations in Octave, or maybe NumPy.

First, let’s look at a situation where you find your self with a non-positive semi-definite (PSD) matrix.  Remember that the default chol() requires a positive definite (PD) matrix and we created a function that can handle a PSD matrix.  Imagine if you have 3 time series with different sampling intervals — in our example we will have one that samples every day, another that samples M-W-F-Sa-Su, and one that samples T-R (Thursday) – Sa.  Further we only have a limited set of data – 3 weeks in our sample.  To create a correlation matrix, we would be forced to either only consider Saturdays (where all 3 series intersect), or to compute the correlations pairwise.  With only 3 intersection points, we would probably choose the pairwise method to attempt to get a better handle on the true underlying process.

We will use SAS IML to construct the series and PROC CORR from SAS to calculate the pairwise correlations.

proc iml;
/*True covariance of the system*/
C1 = {1 .5 0,
.5 1 .25,
0 .25 1};

r = root(C1)`;n=21;
out = J(n,3,.);

/*Generate 21 days worth of data*/

do i=1 to n;
hold = J(1,3);
do j=1 to 3;
hold[j] = rannor(2);
end;

hold = r * hold`;
out[i,] = hold`;

end;

create out from out;
append from out;
close out;
quit;

/*Filter out the unobserved values*/
data out;
format date date9.;
set out;
retain date “04DEC2011″d;
if _n_ ^=1 then   date = date + 1;

if ^(weekday(date) in (2, 4, 6, 7, 1)) then   col2 = .;

if ^(weekday(date) in (3, 5, 7)) then   col3 = .;
run;

title “Observed Data”;
proc print data=out noobs;
run;

title “Pairwise Correlation, observed”;
proc corr data=out(drop=date)
out=corr_out(where=(_type_ = “CORR”))
;run;

data corr_out;
set corr_out(drop=_type_ _name_);
run;

title “Correlation Eigenvalues”;
proc iml;
use corr_out;
close corr_out; e = eigval(corr); print e;
quit;

Here is the output:

 Observed Data
 date COL1 COL2 COL3 04DEC2011 1.31183 0.08083 . 05DEC2011 2.04198 2.26740 . 06DEC2011 0.55341 . -0.86957 07DEC2011 -0.29515 0.40556 . 08DEC2011 -0.11603 . 0.24319 09DEC2011 1.24763 0.92012 . 10DEC2011 0.65947 -0.30911 -0.28241 11DEC2011 0.69131 0.72022 . 12DEC2011 1.83849 0.31602 . 13DEC2011 -0.35605 . -0.16218 14DEC2011 -0.96497 -2.40601 . 15DEC2011 -0.42313 . -0.16670 16DEC2011 0.36461 -0.44278 . 17DEC2011 -1.97964 -0.47143 1.70729 18DEC2011 0.56940 0.68279 . 19DEC2011 -0.72632 -1.30090 . 20DEC2011 0.51960 . -0.15246 21DEC2011 -0.73427 -0.13240 . 22DEC2011 0.34385 . -0.79128 23DEC2011 0.00112 -1.70289 . 24DEC2011 -0.38676 -0.74608 -0.16191

 Pairwise Correlation, observed

The CORR Procedure
 3 Variables: COL1 COL2 COL3

 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum COL1 21 0.19811 0.96281 4.16039 -1.97964 2.04198 COL2 15 -0.14124 1.14684 -2.11865 -2.40601 2.26740 COL3 9 -0.07067 0.74955 -0.63602 -0.86957 1.70729

Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations
COL1
COL2
COL3
COL1
 1.00000 21
 0.66712 0.0066 15
 -0.87999 0.0017 9
COL2
 0.66712 0.0066 15
 1.00000 15
 0.09317 0.9406 3
COL3
 -0.87999 0.0017 9
 0.09317 0.9406 3
 1.00000 9

 Correlation Eigenvalues

e
2.0606421
1.0896653
-0.150307

The Matrix package from R has a function called nearPD().  This function has some nice features such as requiring the diagonal to remain constant (useful for a covariance matrix) and a flag to make sure the output is still a correlation matrix.  Not sure what the difference between those two would be if I used a correlation matrix as the input…  The output also include the new eigenvalues, and the Frobenius Norm (a measure of closeness for matrices).

Rebonato and Jackel (1999) define 2 methods in their paper for finding a PSD matrix.  The first is rather involved and the second, the Spectral Decomposition, is very compact.  The Spectral Decomposition is what we will emulate.  Take a second to go read that section (pp 7 – 9).

Here it is in R:

nearPSD = function(c){
n = dim(c)[1]
e = eigen(c,sym=TRUE)
val = e\$values * (e\$values > 0)
vec = e\$vectors
T = sqrt(diag( as.vector(1/(vec^2 %*% val)),n,n))
B = T %*% vec %*% diag(as.vector(sqrt(val)),n,n)
out = B %*% t(B)
return(out)
}

Using the example from the paper, we see this duplicates their results:

> c
[,1] [,2] [,3]
[1,]  1.0  0.9  0.7
[2,]  0.9  1.0  0.3
[3,]  0.7  0.3  1.0

> nearPSD(c)
[,1]      [,2]      [,3]
[1,] 1.0000000 0.8940244 0.6963191
[2,] 0.8940244 1.0000000 0.3009690
[3,] 0.6963191 0.3009690 1.0000000

The nearPD() function gives us something very close to the above:

> nearPD(c,corr=TRUE)
\$mat
3 x 3 Matrix of class “dpoMatrix”
[,1]      [,2]      [,3]
[1,] 1.0000000 0.8945753 0.6966207
[2,] 0.8945753 1.0000000 0.3025436
[3,] 0.6966207 0.3025436 1.0000000
\$eigenvalues
[1] 2.291883e+00 7.081174e-01 2.291883e-08
\$corr
[1] TRUE
\$normF
[1] 0.009727988
\$iterations
[1] 13
\$rel.tol
[1] 7.667644e-08
\$converged
[1] TRUE
attr(,”class”)
[1] “nearPD”

So what is more efficient, called nearPD() and using the built in chol() function, or calling our nearPSD() and calling the my_chol() function from before?

n = 1000
c = matrix(.9,n,n)
for( i in 1:n){
c[i,i] = 1
}
c[1,2] = 0
c[2,1] = 0
c[1:5,1:5]
[,1] [,2] [,3] [,4] [,5]
[1,]  1.0  0.0  0.9  0.9  0.9
[2,]  0.0  1.0  0.9  0.9  0.9
[3,]  0.9  0.9  1.0  0.9  0.9
[4,]  0.9  0.9  0.9  1.0  0.9
[5,]  0.9  0.9  0.9  0.9  1.0

e = eigen(c,sym=TRUE)
min(e\$values)
[1] -0.7982018

s = Sys.time()
pd = nearPD(c,corr=TRUE)
pd = as.matrix(pd\$mat)
rpd = chol(pd)
e = Sys.time()
e – s
Time difference of 45.659 secs

pd[1:5,1:5]
[,1]      [,2]      [,3]      [,4]      [,5]
[1,] 1.0000000 0.7934317 0.8984058 0.8984058 0.8984058
[2,] 0.7934317 1.0000000 0.8984058 0.8984058 0.8984058
[3,] 0.8984058 0.8984058 1.0000000 0.9000032 0.9000032
[4,] 0.8984058 0.8984058 0.9000032 1.0000000 0.9000032
[5,] 0.8984058 0.8984058 0.9000032 0.9000032 1.0000000

s = Sys.time()
psd = nearPSD(c)
rpsd = my_chol(psd,50)
e = Sys.time()
e – s
Time difference of 7.031 secs

psd[1:5,1:5]
[,1]      [,2]      [,3]      [,4]      [,5]
[1,] 1.0000000 0.2848481 0.7604250 0.7604250 0.7604250
[2,] 0.2848481 1.0000000 0.7604250 0.7604250 0.7604250
[3,] 0.7604250 0.7604250 1.0000000 0.9000002 0.9000002
[4,] 0.7604250 0.7604250 0.9000002 1.0000000 0.9000002
[5,] 0.7604250 0.7604250 0.9000002 0.9000002 1.0000000

norm(c – pd ,type=”F”)
[1] 1.126598

norm(c – psd,type=”F”)
[1] 8.827865

So the nearPSD method is much much faster. This is due to the fact that the nearPD method is iterative.  However the “difference” between the output matrix is larger. Qualitatively, the nearPD drastically increases the correlation between 1 and 2 — from 0 to .79. The nearPSD keeps that correlation lower, .285, at the expense of the correlation between 1 and 2 and the rest of the values.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.