Multivariate probit regression using (direct) maximum likelihood estimators

[This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Consider a random pair https://i2.wp.com/freakonometrics.blog.free.fr/public/perso3/biv-prob-01.gif?w=578 of binary responses, i.e. https://i0.wp.com/freakonometrics.blog.free.fr/public/perso3/biv-prob-02.gif?w=578 with https://i0.wp.com/freakonometrics.blog.free.fr/public/perso3/biv-prob-03.gif?w=578 taking values 1 or 2. Assume that probability https://i2.wp.com/freakonometrics.blog.free.fr/public/perso3/biv-prob-04.gif?w=578 can be function of some covariates http://freakonometrics.blog.free.fr/public/perso3/biv-prob-05.gif.
  • The Gaussian vector latent structure
A standard model is based a latent Gaussian structure, i.e. there exists some random vector https://i2.wp.com/freakonometrics.blog.free.fr/public/perso3/biv-prob-06.gif?w=578 such that https://i0.wp.com/freakonometrics.blog.free.fr/public/perso3/biv-prob-07.gif?w=578 if https://i2.wp.com/freakonometrics.blog.free.fr/public/perso3/biv-prob-08.gif?w=578 is lower than a given threshold, and 1 otherwise.
As in standard probit models, assume that
https://i0.wp.com/freakonometrics.blog.free.fr/public/perso3/biv-prob-09.gif?w=578
where we can assume that https://i0.wp.com/freakonometrics.blog.free.fr/public/perso3/biv-prob-10.gif?w=578 is a Gaussian random vector. This assumption can be used to derive the likelihood of a sample https://i0.wp.com/freakonometrics.blog.free.fr/public/perso3/biv-prob-11.gif?w=578.
> logV=function(parameter){
+ CORRELATION=parameter[1]
+ BETA=matrix(parameter[2:length(parameter)],ncol(Y),ncol(X))
+ z=cbind(X%*%(BETA[1,]),X%*%(BETA[2,]))
+ sigma=matrix(c(1,CORRELATION,CORRELATION,1),2,2)
+     a11=pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma)
+ for(i in 2:nrow(z)){a11=c(a11,pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))}
+     a10=pnorm(z[1,1],sd=sqrt(sigma[1,1]))-pmnorm(z[1,],varcov=sigma)
+ for(i in
+ 2:nrow(z)){a10=c(a10,pnorm(z[i,1],sd=sqrt(sigma[1,1]))-pmnorm(z[i,],varcov=sigma))}
+     a01=pnorm(z[1,2],sd=sqrt(sigma[2,2]))-pmnorm(z[1,],varcov=sigma)
+ for(i in
+ 2:nrow(z)){a01=c(a01,pnorm(z[i,2],sd=sqrt(sigma[2,2]))-pmnorm(z[i,],varcov=sigma))}
+     a00=1-a10-a01-a11
+ -sum(((Y[,1]==1)&(Y[,2]==1))*log(a11) +
+      ((Y[,1]==0)&(Y[,2]==1))*log(a01) +
+      ((Y[,1]==1)&(Y[,2]==0))*log(a10) +
+      ((Y[,1]==0)&(Y[,2]==0))*log(a00) )
+ }
> OPT=optim(fn=logV,par=c(0,1,1,1,1,1,1),method="BFGS")$par
(the code is a bit long since I had trouble working properly with matrices – or more precisely to vectorize my functions – so I used loops… I am sure it is possible to write a better code).
It is possible to generate samples (based on that specific model) to check that we can actually derive proper maximum likelihood estimators,
> library(mnormt)
> set.seed(1)
> n=1000
> r=0.5
> X1=runif(n)
> X2=rnorm(n)
> Y1S=1+5*X1
> Y2S=8-5*X1
> RES=rmnorm(n,mean=c(0,0),varcov=matrix(c(1,r,r,1),2,2))
> YS=cbind(Y1S,Y2S)+RES
> Y1=(YS[,1]>quantile(YS[,1],.5))*1
> Y2=(YS[,2]>quantile(YS[,2],.5))*1
> base=data.frame(i,Y1,Y2,X1,X2,YS)
> head(base)
  i Y1 Y2        X1          X2      Y1S      Y2S
1 1  0  0 0.2655087  0.07730312 3.177587 5.533884
2 2  0  0 0.3721239 -0.29686864 1.935307 5.089524
3 3  1  0 0.5728534 -1.18324224 4.757848 5.172584
4 4  1  0 0.9082078  0.01129269 4.600029 3.878225
5 5  0  1 0.2016819  0.99160104 2.547362 6.743714
6 6  1  0 0.8983897  1.59396745 5.309974 4.421523

(the two columns on the right are latent observations, that cannot be used since theoretically they are unobservable). Note that it is a simple regression, one of the component is here only to bring some noise. First of all, let us look at marginal probit regression

>  reg1=glm(Y1~X1+X2,data=base,family=binomial)
>  reg2=glm(Y2~X1+X2,data=base,family=binomial)
> summary(reg1)
 
Call:
glm(formula = Y1 ~ X1 + X2, family = binomial, data = base)
 
Deviance Residuals:
Min        1Q    Median        3Q       Max
-2.90570  -0.50126  -0.00266   0.49162   2.78256
 
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.291725   0.267149 -16.065   <2e-16 
X1           8.656836   0.510153  16.969   <2e-16 ***
X2           0.007375   0.090530   0.081    0.935
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1386.29  on 999  degrees of freedom
Residual deviance:  726.48  on 997  degrees of freedom
AIC: 732.48

Number of Fisher Scoring iterations: 5
> summary(reg2)
Call:
glm(formula = Y2 ~ X1 + X2, family = binomial, data = base)
Deviance Residuals:
Min        1Q    Median        3Q       Max
-2.74682  -0.51814  -0.00001   0.57969   2.58565
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  3.91709    0.24399  16.054   <2e-16 ***
X1          -7.89703    0.46277 -17.065   <2e-16 ***
X2           0.18360    0.08758   2.096    0.036 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1386.29  on 999  degrees of freedom
Residual deviance:  777.61  on 997  degrees of freedom
AIC: 783.61
Number of Fisher Scoring iterations: 5

Here, the optimization yields,

> OPT=optim(fn=logV,par=c(0,1,1,1,1,1,1),method="BFGS")$par
> OPT[1]
[1] 0.5261382
> matrix(OPT[2:7],2,3)
          [,1]      [,2]       [,3]
[1,] -2.451721  4.908633 0.01600769
[2,]  2.241962 -4.539946 0.10614807

Note that the coefficients we have obtained are almost identical to the ones obtained with R standard procedure,

>  library(Zelig)
>  REG= zelig(list(mu1=Y1~X1+X2,
+             mu2=Y2~X1+X2,
+     rho=~1),
+     model="bprobit",data=base)
>  summary(REG)
 
Call:
zelig(formula = list(mu1 = Y1 ~ X1 + X2, mu2 = Y2 ~ X1 + X2,
    rho = ~1), model = "bprobit", data = base)
 
Pearson Residuals:
                 Min        1Q     Median      3Q     Max
probit(mu1) -10.5442 -0.377243  0.0041803 0.36709 8.60398
probit(mu2)  -7.8547 -0.376888  0.0083715 0.42923 5.88264
rhobit(rho) -13.8322 -0.091502 -0.0080544 0.37218 0.85101
 
Coefficients:
                  Value Std. Error   t value
(Intercept):1 -2.451699   0.135369 -18.11116
(Intercept):2  2.241964   0.125072  17.92536
(Intercept):3  1.169461   0.189771   6.16249
X1:1           4.908617   0.252683  19.42602
X1:2          -4.539951   0.233632 -19.43203
X2:1           0.015992   0.050443   0.31703
X2:2           0.106154   0.049092   2.16235
 
Number of linear predictors:  3
 
Names of linear predictors: probit(mu1), probit(mu2), rhobit(rho)
 
Dispersion Parameter for binom2.rho family:   1
 
Residual Deviance: 1460.355 on 2993 degrees of freedom
 
Log-likelihood: -730.1774 on 2993 degrees of freedom
 
Number of Iterations: 3

> matrix(coefficients(REG)[c(1:2,4:7)],2,3)
          [,1]      [,2]       [,3]
[1,] -2.451699  4.908617 0.01599183
[2,]  2.241964 -4.539951 0.10615443
The correlation here is also the same
> (exp(summary(REG)@coef3[3])-1)/(exp(summary(REG)@coef3[3])+1)
[1] 0.5260951

That procedure works well an can be extended to ordinal responses (not only binary ones, or to three dimensional problems,

logV=function(beta){
BETA=matrix(beta[4:(3+ncol(Y)*ncol(X))],ncol(Y),ncol(X))
z=cbind(X%*%(BETA[1,]),X%*%(BETA[2,]),X%*%(BETA[3,]))
r12=beta[1]
r23=beta[2]
r31=beta[3]
s1=s2=s3=1
sigma=matrix(c(s1^2,r12*s1*s2,r31*s1*s3,
               r12*s1*s2,s2^2,r23*s2*s3,
               r31*s1*s3,r23*s2*s3,s3^2),3,3)
sigma1=matrix(c(s2^2,r23*s2*s3,
                r23*s2*s3,s3^2),2,2)
sigma2=matrix(c(s1^2,r31*s1*s3,
                r31*s1*s3,s3^2),2,2)
sigma3=matrix(c(s1^2,r12*s1*s2,
                r12*s1*s2,s2^2),2,2)
    a111=pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma)
for(i in 2:nrow(z)){a111=c(a111,pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))}
    a011=pmnorm(z[1,2:3],varcov=sigma1)-pmnorm(z[1,],varcov=sigma)
for(i in 2:nrow(z)){a011=c(a011,pmnorm(z[i,2:3],varcov=sigma1)-pmnorm(z[i,],varcov=sigma))}
    a101=pmnorm(z[1,c(1,3)],varcov=sigma2)-pmnorm(z[1,],varcov=sigma)
for(i in 2:nrow(z)){a101=c(a101,pmnorm(z[i,c(1,3)],varcov=sigma2)-pmnorm(z[i,],varcov=sigma))}
    a110=pmnorm(z[1,1:2],varcov=sigma3)-pmnorm(z[1,],varcov=sigma)
for(i in 2:nrow(z)){a110=c(a110,pmnorm(z[i,1:2],varcov=sigma3)-pmnorm(z[i,],varcov=sigma))}
    a100=pnorm(z[1,1],sd=s1)-pmnorm(z[1,c(1,2)],varcov=sigma3)-pmnorm(z[1,c(1,3)],varcov=sigma2)+pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma)
for(i in 2:nrow(z)){a100=c(a100,pnorm(z[i,1],sd=s1)-pmnorm(z[i,c(1,2)],varcov=sigma3)-pmnorm(z[i,c(1,3)],varcov=sigma2)+pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))}
    a010=pnorm(z[1,2],sd=s2)-pmnorm(z[1,c(1,2)],varcov=sigma3)-pmnorm(z[1,c(2,3)],varcov=sigma1)+pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma)
for(i in 2:nrow(z)){a010=c(a010,pnorm(z[i,2],sd=s2)-pmnorm(z[i,c(1,2)],varcov=sigma3)-pmnorm(z[i,c(2,3)],varcov=sigma1)+pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))}
    a001=pnorm(z[1,3],sd=s3)-pmnorm(z[1,c(2,3)],varcov=sigma1)-pmnorm(z[1,c(1,3)],varcov=sigma2)+pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma)
for(i in 2:nrow(z)){a001=c(a001,pnorm(z[i,3],sd=s3)-pmnorm(z[i,c(2,3)],varcov=sigma1)-pmnorm(z[i,c(1,3)],varcov=sigma2)+pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))}
    a000=1-a111-a011-a101-a110-a001-a010-a100
 
a111[a111<=0]=1e-50
a110[a110<=0]=1e-50
a101[a101<=0]=1e-50
a011[a011<=0]=1e-50
a100[a100<=0]=1e-50
a010[a010<=0]=1e-50
a001[a001<=0]=1e-50
a000[a000<=0]=1e-50
 
-sum(((Y[,1]==0)&(Y[,2]==0)&(Y[,3]==0))*log(a111) +
     ((Y[,1]==1)&(Y[,2]==0)&(Y[,3]==0))*log(a011) +
     ((Y[,1]==0)&(Y[,2]==1)&(Y[,3]==0))*log(a101) +
     ((Y[,1]==0)&(Y[,2]==0)&(Y[,3]==1))*log(a110) +
     ((Y[,1]==1)&(Y[,2]==1)&(Y[,3]==0))*log(a001) +
     ((Y[,1]==1)&(Y[,2]==0)&(Y[,3]==1))*log(a010) +
     ((Y[,1]==0)&(Y[,2]==1)&(Y[,3]==1))*log(a100) +
     ((Y[,1]==1)&(Y[,2]==1)&(Y[,3]==1))*log(a000) )
}

A strong assumption in that bivariate model is that residuals have a Gaussian structure. It is possible to change that assumption

  • marginally: for instance if we use a logistic cumulative distribution function, then we will have a bivariate logit regression
  • in terms of dependence structure: it is possible to consider another copula than the gaussian one, e.g. Gumbel's copula (also called the bivariate logistic copula), or Clayton's
Here, the following code can be used to extend the model to non Gaussian structures,
> F=function(x,r){pmnorm(x,rep(0,length(x)),
+                 varcov=matrix(c(1,r,r,1),2,2))}
> Fx=function(x1){F(c(x1,1e40),0)}
> Fy=function(x2){Fx(x2)}
> 
> logVgen=function(parameter){
+ CORRELATION=parameter[1]
+ BETA=matrix(parameter[2:length(parameter)],ncol(Y),ncol(X))
+ z=cbind(X%*%(BETA[1,]),X%*%(BETA[2,]))
+     a11=F(z[1,],r=CORRELATION)
+ for(i in 2:nrow(z)){a11=c(a11,F(z[i,],r=CORRELATION))}
+     a10=Fx(z[1,1])-F(z[1,],r=CORRELATION)
+ for(i in 2:nrow(z)){a10=c(a10,Fx(z[i,1])-F(z[i,],r=CORRELATION))}
+     a01=Fy(z[1,2])-F(z[1,],r=CORRELATION)
+ for(i in 2:nrow(z)){a01=c(a01,Fy(z[i,2])-F(z[i,],r=CORRELATION))}
+     a00=1-a10-a01-a11
+ -sum(((Y[,1]==1)&(Y[,2]==1))*log(a11) +
+      ((Y[,1]==0)&(Y[,2]==1))*log(a01) +
+      ((Y[,1]==1)&(Y[,2]==0))*log(a10) +
+      ((Y[,1]==0)&(Y[,2]==0))*log(a00) )
+ }
>
> beta0=c(0,1,1,1,1,1,1)
> (OPT=optim(fn=logVgen,par=beta0,method="BFGS")$par)
[1]  0.52613820 -2.45172059  2.24196154  4.90863292 -4.53994592  0.01600769
[7]  0.10614807
There were 23 warnings (use warnings() to see them)
E.g.
> library(copula)
> F=function(x,r){pcopula(pnorm(x),
               claytonCopula(2, r))}
> Fx=function(x1){F(c(x1,1e40),0)}
> Fy=function(x2){Fx(x2)}
  • An application to school tests

Consider the following dataset,

hsb2=read.table("http://freakonometrics.free.fr/hsb2.csv",
        header=TRUE, sep=",")
math_male=hsb2$math[female==0]
write_male=hsb2$write[female==0]
math_female=hsb2$math[female==1]
write_female=hsb2$write[female==1]
plot(math_female, write_female, type="p",
     pch=19,col="red",xlab="maths",ylab="writing",cex=.8)
points(math_male, write_male, cex=1.2, col="blue")

with here maths versus writing, with girls in red and boys in blue, where variables here are
  female :
    0: male
    1: female
  race :
    1: hispanic
    2: asian
    3: african-amer
    4: white
  ses :
    1: low
    2: middle
    3: high
  schtyp : type of school
    1: public
    2: private
  prog : type of program
    1: general
    2: academic
    3: vocation
  read : reading score
  write : writing score
  math : math score
  science : science score
  socst : social studies score
We can try to understand correlation between math and writing skills. Covariates can be the sex of the child, and his reading skills. The question will then be: are good students in maths and writing simply students that can read well ?

Here the code is simply
> W=hsb2$write>=50
> M=hsb2$math>=50
> base=data.frame(Y1=W,Y2=M,
+             X1=hsb2$female,X2=hsb2$read)
>
> library(Zelig)
> REG= zelig(list(mu1=Y1~X1+X2,
+             mu2=Y2~X1+X2,
+     rho=~1),
+     model="bprobit",data=base)
> summary(REG)
 
Call:
zelig(formula = list(mu1 = Y1 ~ X1 + X2, mu2 = Y2 ~ X1 + X2,
    rho = ~1), model = "bprobit", data = base)
 
Pearson Residuals:
                Min        1Q  Median      3Q    Max
probit(mu1) -4.7518 -0.502594 0.15038 0.53038 1.8592
probit(mu2) -3.4243 -0.653537 0.23673 0.67011 2.6072
rhobit(rho) -4.9821  0.010481 0.13500 0.40776 2.9171
 
Coefficients:
                  Value Std. Error  t value
(Intercept):1 -5.484711   0.787101 -6.96825
(Intercept):2 -4.061384   0.633781 -6.40818
(Intercept):3  1.332187   0.322175  4.13497
X1:1           1.125924   0.233550  4.82092
X1:2           0.167258   0.202498  0.82598
X2:1           0.103997   0.014662  7.09286
X2:2           0.082739   0.012026  6.88017
 
Number of linear predictors:  3
 
Names of linear predictors: probit(mu1), probit(mu2), rhobit(rho)
 
Dispersion Parameter for binom2.rho family:   1
 
Residual Deviance: 364.51 on 593 degrees of freedom
 
Log-likelihood: -182.255 on 593 degrees of freedom
 
Number of Iterations: 3
> (exp(summary(REG)@coef3[3])-1)/(exp(summary(REG)@coef3[3])+1)
[1] 0.5824045
with a remaining correlation among residuals of 0.58. So with only the sex of the student, and his or her reading skill, we cannot explain the correlation between maths and writing skills. With our previous code, we have here
> beta0=c((exp(summary(REG)@coef3[3])-1)/(exp(summary(REG)@coef3[3])+1),
+      summary(REG)@coef3[c(1:2,4:7),1])
> beta0
              (Intercept):1 (Intercept):2          X1:1          X1:2
0.58240446   -5.48471133   -4.06138412    1.12592427    0.16725842
X2:1          X2:2
0.10399668    0.08273879
> (OPT=optim(fn=logV,par=beta0,method="BFGS")$par)
(Intercept):1 (Intercept):2          X1:1          X1:2
0.5824045    -5.4847113    -4.0613841     1.1259243     0.1672584
X2:1          X2:2
0.1039967     0.0827388

i.e. we obtain (almost) exactly the same estimators. But here I have used as starting values for the optimization procedure the estimators given by R. If we change them, hopefully we have a robust maximum likelihood estimator,

> (OPT=optim(fn=logV,par=beta0/2,method="BFGS")$par)
              (Intercept):1 (Intercept):2          X1:1          X1:2
   0.58233360   -5.49428984   -4.06839571    1.12696594    0.16760347
         X2:1          X2:2
   0.10417767    0.08287409
There were 12 warnings (use warnings() to see them)
So once again, it is possible to optimize numerically a likelihood function, and it works.

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)