# Multivariate probit regression using (direct) maximum likelihood estimators

May 11, 2011
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Consider a random pair
of binary responses, i.e.
with
taking values 1 or 2. Assume that probability
can be function of some covariates .

• The Gaussian vector latent structure

A standard model is based a latent Gaussian structure, i.e. there
exists some random vector
such that
if
is lower than a given threshold, and 1 otherwise.
As in standard probit models, assume that

where we can assume that
is a Gaussian
random vector
. This assumption can be used to derive the
likelihood of a sample .

`> logV=function(parameter){+ CORRELATION=parameter[1]+ BETA=matrix(parameter[2:length(parameter)],ncol(Y),ncol(X))+ z=cbind(X%*%(BETA[1,]),X%*%(BETA[2,]))+ sigma=matrix(c(1,CORRELATION,CORRELATION,1),2,2)+     a11=pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma)+ for(i in 2:nrow(z)){a11=c(a11,pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))}+     a10=pnorm(z[1,1],sd=sqrt(sigma[1,1]))-pmnorm(z[1,],varcov=sigma)+ for(i in+ 2:nrow(z)){a10=c(a10,pnorm(z[i,1],sd=sqrt(sigma[1,1]))-pmnorm(z[i,],varcov=sigma))}+     a01=pnorm(z[1,2],sd=sqrt(sigma[2,2]))-pmnorm(z[1,],varcov=sigma)+ for(i in+ 2:nrow(z)){a01=c(a01,pnorm(z[i,2],sd=sqrt(sigma[2,2]))-pmnorm(z[i,],varcov=sigma))}+     a00=1-a10-a01-a11+ -sum(((Y[,1]==1)&(Y[,2]==1))*log(a11) ++      ((Y[,1]==0)&(Y[,2]==1))*log(a01) ++      ((Y[,1]==1)&(Y[,2]==0))*log(a10) ++      ((Y[,1]==0)&(Y[,2]==0))*log(a00) )+ }> OPT=optim(fn=logV,par=c(0,1,1,1,1,1,1),method="BFGS")\$par`

(the code is a bit long since I had trouble working properly with
matrices – or more precisely to vectorize my functions – so I used
loops… I am sure it is possible to write a better code).
It is possible to generate samples (based on that specific model) to
check that we can actually derive proper maximum likelihood estimators,

`> library(mnormt)> set.seed(1)> n=1000> r=0.5> X1=runif(n)> X2=rnorm(n)> Y1S=1+5*X1> Y2S=8-5*X1> RES=rmnorm(n,mean=c(0,0),varcov=matrix(c(1,r,r,1),2,2))> YS=cbind(Y1S,Y2S)+RES> Y1=(YS[,1]>quantile(YS[,1],.5))*1> Y2=(YS[,2]>quantile(YS[,2],.5))*1> base=data.frame(i,Y1,Y2,X1,X2,YS)> head(base)  i Y1 Y2        X1          X2      Y1S      Y2S1 1  0  0 0.2655087  0.07730312 3.177587 5.5338842 2  0  0 0.3721239 -0.29686864 1.935307 5.0895243 3  1  0 0.5728534 -1.18324224 4.757848 5.1725844 4  1  0 0.9082078  0.01129269 4.600029 3.8782255 5  0  1 0.2016819  0.99160104 2.547362 6.7437146 6  1  0 0.8983897  1.59396745 5.309974 4.421523`

(the two columns on the right are latent observations, that cannot
be used since theoretically they are unobservable). Note that it is a simple regression, one of the
component is here only to bring some noise. First of all, let us look
at marginal probit regression

`>  reg1=glm(Y1~X1+X2,data=base,family=binomial)>  reg2=glm(Y2~X1+X2,data=base,family=binomial)> summary(reg1) Call:glm(formula = Y1 ~ X1 + X2, family = binomial, data = base) Deviance Residuals:Min        1Q    Median        3Q       Max-2.90570  -0.50126  -0.00266   0.49162   2.78256 Coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) -4.291725   0.267149 -16.065   <2e-16 X1           8.656836   0.510153  16.969   <2e-16 ***X2           0.007375   0.090530   0.081    0.935---Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1)Null deviance: 1386.29  on 999  degrees of freedomResidual deviance:  726.48  on 997  degrees of freedomAIC: 732.48Number of Fisher Scoring iterations: 5> summary(reg2)Call:glm(formula = Y2 ~ X1 + X2, family = binomial, data = base)Deviance Residuals:Min        1Q    Median        3Q       Max-2.74682  -0.51814  -0.00001   0.57969   2.58565Coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept)  3.91709    0.24399  16.054   <2e-16 ***X1          -7.89703    0.46277 -17.065   <2e-16 ***X2           0.18360    0.08758   2.096    0.036 *---Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1)Null deviance: 1386.29  on 999  degrees of freedomResidual deviance:  777.61  on 997  degrees of freedomAIC: 783.61Number of Fisher Scoring iterations: 5`

Here, the optimization yields,

`> OPT=optim(fn=logV,par=c(0,1,1,1,1,1,1),method="BFGS")\$par> OPT[1][1] 0.5261382> matrix(OPT[2:7],2,3)          [,1]      [,2]       [,3][1,] -2.451721  4.908633 0.01600769[2,]  2.241962 -4.539946 0.10614807`

Note that the coefficients we have obtained are almost identical to
the ones obtained with R standard procedure,

`>  library(Zelig)>  REG= zelig(list(mu1=Y1~X1+X2,+             mu2=Y2~X1+X2,+     rho=~1),+     model="bprobit",data=base)>  summary(REG) Call:zelig(formula = list(mu1 = Y1 ~ X1 + X2, mu2 = Y2 ~ X1 + X2,    rho = ~1), model = "bprobit", data = base) Pearson Residuals:                 Min        1Q     Median      3Q     Maxprobit(mu1) -10.5442 -0.377243  0.0041803 0.36709 8.60398probit(mu2)  -7.8547 -0.376888  0.0083715 0.42923 5.88264rhobit(rho) -13.8322 -0.091502 -0.0080544 0.37218 0.85101 Coefficients:                  Value Std. Error   t value(Intercept):1 -2.451699   0.135369 -18.11116(Intercept):2  2.241964   0.125072  17.92536(Intercept):3  1.169461   0.189771   6.16249X1:1           4.908617   0.252683  19.42602X1:2          -4.539951   0.233632 -19.43203X2:1           0.015992   0.050443   0.31703X2:2           0.106154   0.049092   2.16235 Number of linear predictors:  3 Names of linear predictors: probit(mu1), probit(mu2), rhobit(rho) Dispersion Parameter for binom2.rho family:   1 Residual Deviance: 1460.355 on 2993 degrees of freedom Log-likelihood: -730.1774 on 2993 degrees of freedom Number of Iterations: 3> matrix(coefficients(REG)[c(1:2,4:7)],2,3)          [,1]      [,2]       [,3][1,] -2.451699  4.908617 0.01599183[2,]  2.241964 -4.539951 0.10615443`

The correlation here is also the same

```> (exp(summary(REG)@coef3[3])-1)/(exp(summary(REG)@coef3[3])+1)[1] 0.5260951

That procedure works well an can be extended to ordinal responses
(not only binary ones, or to three dimensional problems,

logV=function(beta){BETA=matrix(beta[4:(3+ncol(Y)*ncol(X))],ncol(Y),ncol(X))z=cbind(X%*%(BETA[1,]),X%*%(BETA[2,]),X%*%(BETA[3,]))r12=beta[1]r23=beta[2]r31=beta[3]s1=s2=s3=1sigma=matrix(c(s1^2,r12*s1*s2,r31*s1*s3,               r12*s1*s2,s2^2,r23*s2*s3,               r31*s1*s3,r23*s2*s3,s3^2),3,3)sigma1=matrix(c(s2^2,r23*s2*s3,                r23*s2*s3,s3^2),2,2)sigma2=matrix(c(s1^2,r31*s1*s3,                r31*s1*s3,s3^2),2,2)sigma3=matrix(c(s1^2,r12*s1*s2,                r12*s1*s2,s2^2),2,2)    a111=pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma)for(i in 2:nrow(z)){a111=c(a111,pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))}    a011=pmnorm(z[1,2:3],varcov=sigma1)-pmnorm(z[1,],varcov=sigma)for(i in 2:nrow(z)){a011=c(a011,pmnorm(z[i,2:3],varcov=sigma1)-pmnorm(z[i,],varcov=sigma))}    a101=pmnorm(z[1,c(1,3)],varcov=sigma2)-pmnorm(z[1,],varcov=sigma)for(i in 2:nrow(z)){a101=c(a101,pmnorm(z[i,c(1,3)],varcov=sigma2)-pmnorm(z[i,],varcov=sigma))}    a110=pmnorm(z[1,1:2],varcov=sigma3)-pmnorm(z[1,],varcov=sigma)for(i in 2:nrow(z)){a110=c(a110,pmnorm(z[i,1:2],varcov=sigma3)-pmnorm(z[i,],varcov=sigma))}    a100=pnorm(z[1,1],sd=s1)-pmnorm(z[1,c(1,2)],varcov=sigma3)-pmnorm(z[1,c(1,3)],varcov=sigma2)+pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma)for(i in 2:nrow(z)){a100=c(a100,pnorm(z[i,1],sd=s1)-pmnorm(z[i,c(1,2)],varcov=sigma3)-pmnorm(z[i,c(1,3)],varcov=sigma2)+pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))}    a010=pnorm(z[1,2],sd=s2)-pmnorm(z[1,c(1,2)],varcov=sigma3)-pmnorm(z[1,c(2,3)],varcov=sigma1)+pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma)for(i in 2:nrow(z)){a010=c(a010,pnorm(z[i,2],sd=s2)-pmnorm(z[i,c(1,2)],varcov=sigma3)-pmnorm(z[i,c(2,3)],varcov=sigma1)+pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))}    a001=pnorm(z[1,3],sd=s3)-pmnorm(z[1,c(2,3)],varcov=sigma1)-pmnorm(z[1,c(1,3)],varcov=sigma2)+pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma)for(i in 2:nrow(z)){a001=c(a001,pnorm(z[i,3],sd=s3)-pmnorm(z[i,c(2,3)],varcov=sigma1)-pmnorm(z[i,c(1,3)],varcov=sigma2)+pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))}    a000=1-a111-a011-a101-a110-a001-a010-a100 a111[a111<=0]=1e-50a110[a110<=0]=1e-50a101[a101<=0]=1e-50a011[a011<=0]=1e-50a100[a100<=0]=1e-50a010[a010<=0]=1e-50a001[a001<=0]=1e-50a000[a000<=0]=1e-50 -sum(((Y[,1]==0)&(Y[,2]==0)&(Y[,3]==0))*log(a111) +     ((Y[,1]==1)&(Y[,2]==0)&(Y[,3]==0))*log(a011) +     ((Y[,1]==0)&(Y[,2]==1)&(Y[,3]==0))*log(a101) +     ((Y[,1]==0)&(Y[,2]==0)&(Y[,3]==1))*log(a110) +     ((Y[,1]==1)&(Y[,2]==1)&(Y[,3]==0))*log(a001) +     ((Y[,1]==1)&(Y[,2]==0)&(Y[,3]==1))*log(a010) +     ((Y[,1]==0)&(Y[,2]==1)&(Y[,3]==1))*log(a100) +     ((Y[,1]==1)&(Y[,2]==1)&(Y[,3]==1))*log(a000) )}

A strong assumption in that
bivariate model is that residuals have a Gaussian structure. It is
possible to change that assumption

marginally: for instance if we use a logistic cumulative
distribution function, then we will have a bivariate logit regression
in terms of dependence structure: it is possible to consider
another copula than the gaussian one,
e.g. Gumbel’s copula (also called the bivariate
logistic copula), or Clayton’s

Here, the following code can be used to extend the model to non
Gaussian structures,

> F=function(x,r){pmnorm(x,rep(0,length(x)),+                 varcov=matrix(c(1,r,r,1),2,2))}> Fx=function(x1){F(c(x1,1e40),0)}> Fy=function(x2){Fx(x2)}> > logVgen=function(parameter){+ CORRELATION=parameter[1]+ BETA=matrix(parameter[2:length(parameter)],ncol(Y),ncol(X))+ z=cbind(X%*%(BETA[1,]),X%*%(BETA[2,]))+     a11=F(z[1,],r=CORRELATION)+ for(i in 2:nrow(z)){a11=c(a11,F(z[i,],r=CORRELATION))}+     a10=Fx(z[1,1])-F(z[1,],r=CORRELATION)+ for(i in 2:nrow(z)){a10=c(a10,Fx(z[i,1])-F(z[i,],r=CORRELATION))}+     a01=Fy(z[1,2])-F(z[1,],r=CORRELATION)+ for(i in 2:nrow(z)){a01=c(a01,Fy(z[i,2])-F(z[i,],r=CORRELATION))}+     a00=1-a10-a01-a11+ -sum(((Y[,1]==1)&(Y[,2]==1))*log(a11) ++      ((Y[,1]==0)&(Y[,2]==1))*log(a01) ++      ((Y[,1]==1)&(Y[,2]==0))*log(a10) ++      ((Y[,1]==0)&(Y[,2]==0))*log(a00) )+ }>> beta0=c(0,1,1,1,1,1,1)> (OPT=optim(fn=logVgen,par=beta0,method="BFGS")\$par)[1]  0.52613820 -2.45172059  2.24196154  4.90863292 -4.53994592  0.01600769[7]  0.10614807There were 23 warnings (use warnings() to see them)

E.g.

> library(copula)> F=function(x,r){pcopula(pnorm(x),               claytonCopula(2, r))}> Fx=function(x1){F(c(x1,1e40),0)}> Fy=function(x2){Fx(x2)}

An application to school tests

Consider the following dataset,

with here maths versus writing, with girls in red and boys in blue, where variables here are
female :    0: male    1: female  race :    1: hispanic    2: asian    3: african-amer    4: white  ses :    1: low    2: middle    3: high  schtyp : type of school    1: public    2: private  prog : type of program    1: general    2: academic    3: vocation  read : reading score  write : writing score  math : math score  science : science score  socst : social studies score
We can try to understand correlation between math and writing skills.
Covariates can be the sex of the child, and his reading skills. The question will
then be: are good students in maths and writing simply students that

Here the code is simply

> W=hsb2\$write>=50> M=hsb2\$math>=50> base=data.frame(Y1=W,Y2=M,+             X1=hsb2\$female,X2=hsb2\$read)>> library(Zelig)> REG= zelig(list(mu1=Y1~X1+X2,+             mu2=Y2~X1+X2,+     rho=~1),+     model="bprobit",data=base)> summary(REG) Call:zelig(formula = list(mu1 = Y1 ~ X1 + X2, mu2 = Y2 ~ X1 + X2,    rho = ~1), model = "bprobit", data = base) Pearson Residuals:                Min        1Q  Median      3Q    Maxprobit(mu1) -4.7518 -0.502594 0.15038 0.53038 1.8592probit(mu2) -3.4243 -0.653537 0.23673 0.67011 2.6072rhobit(rho) -4.9821  0.010481 0.13500 0.40776 2.9171 Coefficients:                  Value Std. Error  t value(Intercept):1 -5.484711   0.787101 -6.96825(Intercept):2 -4.061384   0.633781 -6.40818(Intercept):3  1.332187   0.322175  4.13497X1:1           1.125924   0.233550  4.82092X1:2           0.167258   0.202498  0.82598X2:1           0.103997   0.014662  7.09286X2:2           0.082739   0.012026  6.88017 Number of linear predictors:  3 Names of linear predictors: probit(mu1), probit(mu2), rhobit(rho) Dispersion Parameter for binom2.rho family:   1 Residual Deviance: 364.51 on 593 degrees of freedom Log-likelihood: -182.255 on 593 degrees of freedom Number of Iterations: 3> (exp(summary(REG)@coef3[3])-1)/(exp(summary(REG)@coef3[3])+1)[1] 0.5824045

with a remaining correlation among residuals of 0.58. So with only the
sex of the student, and his or her reading skill, we cannot explain the
correlation between maths and writing skills. With our previous code,
we have here

> beta0=c((exp(summary(REG)@coef3[3])-1)/(exp(summary(REG)@coef3[3])+1),+      summary(REG)@coef3[c(1:2,4:7),1])> beta0              (Intercept):1 (Intercept):2          X1:1          X1:20.58240446   -5.48471133   -4.06138412    1.12592427    0.16725842X2:1          X2:20.10399668    0.08273879> (OPT=optim(fn=logV,par=beta0,method="BFGS")\$par)(Intercept):1 (Intercept):2          X1:1          X1:20.5824045    -5.4847113    -4.0613841     1.1259243     0.1672584X2:1          X2:20.1039967     0.0827388

i.e. we obtain (almost) exactly the same estimators. But here I have
used as starting values for the optimization procedure the estimators
given by R. If we change them, hopefully we have a robust maximum
likelihood estimator,

> (OPT=optim(fn=logV,par=beta0/2,method="BFGS")\$par)              (Intercept):1 (Intercept):2          X1:1          X1:2   0.58233360   -5.49428984   -4.06839571    1.12696594    0.16760347         X2:1          X2:2   0.10417767    0.08287409There were 12 warnings (use warnings() to see them)

So once again, it is possible to optimize numerically a likelihood function, and it works.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };

(function(d, t) {
var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, 'script'));

Related
ShareTweet

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or  here if you don't.