# Binomial regression model

November 18, 2013
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

Most of the time, when we introduce binomial models, such as the logistic or probit models, we discuss only Bernoulli variables, $Y_i\sim\mathcal{B}(p(\boldsymbol{X_i}))$. This year (actually also the year before), I discuss extensions to multinomial regressions$Y_i\sim\mathcal{M}(\boldsymbol{p}(\boldsymbol{X_i}))$where $\boldsymbol{p}(\cdot)$ is a function on some simplex. The multinomial logistic model was mention here. The idea is to consider, for instance with three possible classes

$\boldsymbol{p}(\boldsymbol{X_i})=(p_A(\boldsymbol{X_i}),p_B(\boldsymbol{X_i}),p_C(\boldsymbol{X_i}))\in\mathcal{S}_2$

the following model

$p_A(\boldsymbol{X_i})=\frac{\exp[\boldsymbol{X}_i'\boldsymbol{\alpha}]}{1+\exp[\boldsymbol{X}_i'\boldsymbol{\alpha}]+\exp[\boldsymbol{X}_i'\boldsymbol{\beta}]}$

$p_B(\boldsymbol{X_i})=\frac{\exp[\boldsymbol{X}_i'\boldsymbol{\beta}]}{1+\exp[\boldsymbol{X}_i'\boldsymbol{\alpha}]+\exp[\boldsymbol{X}_i'\boldsymbol{\beta}]}$

and

$p_C(\boldsymbol{X_i})=\frac{1}{1+\exp[\boldsymbol{X}_i'\boldsymbol{\alpha}]+\exp[\boldsymbol{X}_i'\boldsymbol{\beta}]}$

Now, what about a real Binomial model, $Y_i\sim\mathcal{B}(N_i,p(\boldsymbol{X}_i))$, where $N_i$‘s are known. How should we run such a regression model ? Consider the following dataset

```> set.seed(1)
> n=100
> N=1+rpois(n,5)
> X1=runif(n)
> X2=rexp(n)
> s=X2-X1-2
> p=exp(s)/(1+exp(s))
> vY=NULL
> for(i in 1:n){
+ Y=rbinom(1,prob=p[i],size=N[i])
+ vY=c(vY,Y)
+ }
> db=data.frame(Y=vY,N=N,X1,X2)
Y N        X1         X2
1 0 5 0.6547239 0.76318001
2 1 5 0.3531973 1.57271671
3 3 6 0.2702601 1.83564098
4 1 9 0.9926841 0.03715227```

My first idea was to say that it should be simple since $Y\sim\mathcal{B}(N,p)$ if (and only if)

$Y=\sum_{i=1}^N Z_i$
where $Z_i$ are i.i.d. random variables $\mathcal{B}(p)$. So, a natural idea is to generate the dataset containing the $Z_i$‘s

```> vY=vX1=vX2=vN=NULL;
> for(i in 1:n){
+ vY=c(vY,c(rep(0,db\$N[i]-db\$Y[i]),rep(1,db\$Y[i])))
+ vX1=c(vX1,rep(db\$X1[i],db\$N[i]))
+ vX2=c(vX2,rep(db\$X2[i],db\$N[i]))
+ }
> largedb=data.frame(Z=vY,X1=vX1,X2=vX2)
Z        X1       X2
1  0 0.6547239 0.763180
2  0 0.6547239 0.763180
3  0 0.6547239 0.763180
4  0 0.6547239 0.763180
5  0 0.6547239 0.763180
6  0 0.3531973 1.572717
7  0 0.3531973 1.572717
8  0 0.3531973 1.572717
9  0 0.3531973 1.572717
10 1 0.3531973 1.572717
11 0 0.2702601 1.835641
12 0 0.2702601 1.835641
13 0 0.2702601 1.835641
14 1 0.2702601 1.835641
15 1 0.2702601 1.835641
16 1 0.2702601 1.835641```

Then, we run a standard Bernoulli regression on those $Z_i$‘s

`­> reg1=glm(Z~X1+X2,family=binomial,data=largedb)`

But actually, if look around, on the internet, you can see (e.g. in Alan Agresti’s R_web.pdf chapter) that it is possible to run – directly – a binomial regression, using the following syntax,

`> reg2=glm(Y/N~X1+X2,family=binomial,weights=N,data=db)`

I was a bit scared because a few weeks ago, I tried two techniques to run a regression on contingency tables, and the output were different (on the standard deviation actually, not the estimation). Here we have the same thing

```> coefficients(summary(reg1))
Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -2.0547550  0.2875231 -7.146399 8.908380e-13
X1          -0.9159275  0.3970303 -2.306946 2.105781e-02
X2           1.0564059  0.1360305  7.765952 8.103448e-15
> coefficients(summary(reg2))
Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -2.0547550  0.2875234 -7.146392 8.908817e-13
X1          -0.9159275  0.3970313 -2.306941 2.105813e-02
X2           1.0564059  0.1360310  7.765923 8.105285e-15```

almost if we take into account the fact that numerical algorithm might be ran with different starting point. But the going thing is that – theoretically – the output should be exactly the same. Simply because here, we solve the same first order conditions ! The likelihood in the first case was

$\prod_j p(X_{j:i})^{Z_j} [1-p(X_{j:i})]^{1-Z_j}$

which will be simplified as

$\prod_i p(X_i)^{Y_i} [1-p(X_i)]^{N_i-Y_i}$

which is what we have in the second case. The use of the weight function will insure that the variance here are equal.