Logistic regression and categorical covariates

Posted on September 26, 2013 by arthur charpentier in R bloggers | 0 Comments

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A short post to get back – for my nonlife insurance course – on the interpretation of the output of a regression when there is a categorical covariate. Consider the following dataset

> db = read.table("http://freakonometrics.free.fr/db.txt",header=TRUE,sep=";")
> tail(db)
     Y       X1       X2 X3
995  1 4.801836 20.82947  A
996  1 9.867854 24.39920  C
997  1 5.390730 21.25119  D
998  1 6.556160 20.79811  D
999  1 4.710276 21.15373  A
1000 1 6.631786 19.38083  A

Let us run a logistic regression on that dataset

> reg = glm(Y~X1+X2+X3,family=binomial,data=db)
> summary(reg)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.45885    1.04646  -4.261 2.04e-05 ***
X1           0.51664    0.11178   4.622 3.80e-06 ***
X2           0.21008    0.07247   2.899 0.003745 ** 
X3B          1.74496    0.49952   3.493 0.000477 ***
X3C         -0.03470    0.35691  -0.097 0.922543    
X3D          0.08004    0.34916   0.229 0.818672    
X3E          2.21966    0.56475   3.930 8.48e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 552.64  on 999  degrees of freedom
Residual deviance: 397.69  on 993  degrees of freedom
AIC: 411.69

Number of Fisher Scoring iterations: 7

Here, the reference is modality $A$ . Which means that for someone with characteristics $(X_1,X_2,X_3=A)$ , we predict the following probability

$p=H(\widehat\beta_0+\widehat\beta_1 X_1+\widehat\beta_2 X_2)$

where $H(\cdot)$ denotes the cumulative distribution function of the logistic distribution

$H(x)=\frac{e^x}{1+e^x}$

For someone with characteristics $(X_1,X_2,X_3=B)$ , we predict the following probability

$p=H(\widehat\beta_0+\widehat\beta_1 X_1+\widehat\beta_2 X_2+\widehat\beta_3^{\ (B)})$

For someone with characteristics $(X_1,X_2,X_3=C)$ , we predict the following probability

$p=H(\widehat\beta_0+\widehat\beta_1 X_1+\widehat\beta_2 X_2+\widehat\beta_3^{\ (C)})$

(etc.) Here, if we accept $H_0:\beta_3^{\ (C)}=0$ (against $H_1:\beta_3^{\ (C)}\neq0$ ), it means that modality $C$ cannot be considerd as different from $A$ .

A natural idea can be to change the reference modality, and to look at the $p$ -values. If we consider the following loop, we get

> M = matrix(NA,5,5)
> rownames(M)=colnames(M)=LETTERS[1:5]
> for(k in 1:5){
+ db$X3 = relevel(X3,LETTERS[k])
+ reg = glm(Y~X1+X2+X3,family=binomial,data=db)
+ M[levels(db$X3)[-1],k] = summary(reg)$coefficients[4:7,4]
+ } 
> M
             A            B            C            D            E
A           NA 0.0004771853 9.225428e-01 0.8186723647 8.482647e-05
B 4.771853e-04           NA 4.841204e-04 0.0009474491 4.743636e-01
C 9.225428e-01 0.0004841204           NA 0.7506242347 9.194193e-05
D 8.186724e-01 0.0009474491 7.506242e-01           NA 1.730589e-04
E 8.482647e-05 0.4743636442 9.194193e-05 0.0001730589           NA

and if we simply want to know if the $p$ -value exceeds – or not – 5%, we get the following,

> M.TF = M>.05
> M.TF
      A     B     C     D     E
A    NA FALSE  TRUE  TRUE FALSE
B FALSE    NA FALSE FALSE  TRUE
C  TRUE FALSE    NA  TRUE FALSE
D  TRUE FALSE  TRUE    NA FALSE
E FALSE  TRUE FALSE FALSE    NA

The first column is obtained when $A$ is the reference, and then, we see which parameter should be considered as null. The interpretation is the following:

$C$ and $D$ are not different from $A$
$E$ is not different from $B$
$A$ and $D$ are not different from $C$
$A$ and $C$ are not different from $D$
$B$ is not different from $E$

Note that we only have, here, some kind of intuition. So, let us run a more formal test. Let us consider the following regression (we remove the intercept to get a model easier to understand)

> library(car)
> db$X3=relevel(X3,"A")
> reg=glm(Y~0+X1+X2+X3,family=binomial,data=db)
> summary(reg)

Coefficients:
    Estimate Std. Error z value Pr(>|z|)    
X1   0.51664    0.11178   4.622 3.80e-06 ***
X2   0.21008    0.07247   2.899  0.00374 ** 
X3A -4.45885    1.04646  -4.261 2.04e-05 ***
X3E -2.23919    1.06666  -2.099  0.03580 *  
X3D -4.37881    1.04887  -4.175 2.98e-05 ***
X3C -4.49355    1.06266  -4.229 2.35e-05 ***
X3B -2.71389    1.07274  -2.530  0.01141 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1386.29  on 1000  degrees of freedom
Residual deviance:  397.69  on  993  degrees of freedom
AIC: 411.69

Number of Fisher Scoring iterations: 7

It is possible to use Fisher test to test if some coefficients are equal, or not (more generally if some linear constraints are satisfied)

> linearHypothesis(reg,c("X3A=X3C","X3A=X3D","X3B=X3E"))
Linear hypothesis test

Hypothesis:
X3A - X3C = 0
X3A - X3D = 0
- X3E  + X3B = 0

Model 1: restricted model
Model 2: Y ~ 0 + X1 + X2 + X3

  Res.Df Df  Chisq Pr(>Chisq)
1    996                     
2    993  3 0.6191      0.892

Here, we clearly accept the assumption that the first three factors are equal, as well as the last two. What is the next step? Well, if we believe that there are mainly two categories, $\{A,C,D\}$ and $\{B,E\}$ , let us create that factor,

> X3bis=rep(NA,length(X3))
> X3bis[X3%in%c("A","C","D")]="ACD"
> X3bis[X3%in%c("B","E")]="BE"
> db$X3bis=as.factor(X3bis)
> reg=glm(Y~X1+X2+X3bis,family=binomial,data=db)
> summary(reg)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.39439    1.02791  -4.275 1.91e-05 ***
X1           0.51378    0.11138   4.613 3.97e-06 ***
X2           0.20807    0.07234   2.876  0.00402 ** 
X3bisBE      1.94905    0.36852   5.289 1.23e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 552.64  on 999  degrees of freedom
Residual deviance: 398.31  on 996  degrees of freedom
AIC: 406.31

Number of Fisher Scoring iterations: 7

Here, all the categories are significant. So we do have a proper model.

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Logistic regression and categorical covariates

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)