Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Most of the time, when we introduce binomial models, such as the logistic or probit models, we discuss only Bernoulli variables, . This year (actually also the year before), I discuss extensions to multinomial regressionswhere  is a function on some simplex. The multinomial logistic model was mention here. The idea is to consider, for instance with three possible classes

the following model

and

Now, what about a real Binomial model, , where ‘s are known. How should we run such a regression model ? Consider the following dataset

```> set.seed(1)
> n=100
> N=1+rpois(n,5)
> X1=runif(n)
> X2=rexp(n)
> s=X2-X1-2
> p=exp(s)/(1+exp(s))
> vY=NULL
> for(i in 1:n){
+ Y=rbinom(1,prob=p[i],size=N[i])
+ vY=c(vY,Y)
+ }
> db=data.frame(Y=vY,N=N,X1,X2)
Y N        X1         X2
1 0 5 0.6547239 0.76318001
2 1 5 0.3531973 1.57271671
3 3 6 0.2702601 1.83564098
4 1 9 0.9926841 0.03715227```

My first idea was to say that it should be simple since  if (and only if)

where  are i.i.d. random variables . So, a natural idea is to generate the dataset containing the ‘s

```> vY=vX1=vX2=vN=NULL;
> for(i in 1:n){
+ vY=c(vY,c(rep(0,db\$N[i]-db\$Y[i]),rep(1,db\$Y[i])))
+ vX1=c(vX1,rep(db\$X1[i],db\$N[i]))
+ vX2=c(vX2,rep(db\$X2[i],db\$N[i]))
+ }
> largedb=data.frame(Z=vY,X1=vX1,X2=vX2)
Z        X1       X2
1  0 0.6547239 0.763180
2  0 0.6547239 0.763180
3  0 0.6547239 0.763180
4  0 0.6547239 0.763180
5  0 0.6547239 0.763180
6  0 0.3531973 1.572717
7  0 0.3531973 1.572717
8  0 0.3531973 1.572717
9  0 0.3531973 1.572717
10 1 0.3531973 1.572717
11 0 0.2702601 1.835641
12 0 0.2702601 1.835641
13 0 0.2702601 1.835641
14 1 0.2702601 1.835641
15 1 0.2702601 1.835641
16 1 0.2702601 1.835641```

Then, we run a standard Bernoulli regression on those ‘s

`­> reg1=glm(Z~X1+X2,family=binomial,data=largedb)`

But actually, if look around, on the internet, you can see (e.g. in Alan Agresti’s R_web.pdf chapter) that it is possible to run – directly – a binomial regression, using the following syntax,

`> reg2=glm(Y/N~X1+X2,family=binomial,weights=N,data=db)`

I was a bit scared because a few weeks ago, I tried two techniques to run a regression on contingency tables, and the output were different (on the standard deviation actually, not the estimation). Here we have the same thing

```> coefficients(summary(reg1))
Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -2.0547550  0.2875231 -7.146399 8.908380e-13
X1          -0.9159275  0.3970303 -2.306946 2.105781e-02
X2           1.0564059  0.1360305  7.765952 8.103448e-15
> coefficients(summary(reg2))
Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -2.0547550  0.2875234 -7.146392 8.908817e-13
X1          -0.9159275  0.3970313 -2.306941 2.105813e-02
X2           1.0564059  0.1360310  7.765923 8.105285e-15```

almost if we take into account the fact that numerical algorithm might be ran with different starting point. But the going thing is that – theoretically – the output should be exactly the same. Simply because here, we solve the same first order conditions ! The likelihood in the first case was

which will be simplified as

which is what we have in the second case. The use of the weight function will insure that the variance here are equal.