% mutate_if(is.character, as.factor) elections_long ## # A tibble: 10 × 4 ## party gender candidates gender_bin ## ## 1 Tories women 57 1 ## 2 Tories men 577 0 ## 3 Labour women 126 1 ## 4 Labour men 508 0 ## 5 LibDem women 136 1 ## 6 LibDem men 496 0 ## 7 Green women 60 1 ## 8 Green men 192 0 ## 9 Other women 135 1 ## 10 Other men 546 0 To specify a Generalized Linear Model that considers Gender (i.e., 1: female, 0: male) as the response and the Political Party as the predictor, we fit the proposed model in R. fit |z|) ## (Intercept) -1.1632 0.1479 -7.864 3.71e-15 *** ## partyLabour -0.2310 0.1783 -1.296 0.195 ## partyLibDem -0.1308 0.1768 -0.740 0.459 ## partyOther -0.2342 0.1764 -1.328 0.184 ## partyTories -1.1516 0.2028 -5.678 1.37e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2683.2 on 9 degrees of freedom ## Residual deviance: 2628.7 on 5 degrees of freedom ## AIC: 2638.7 ## ## Number of Fisher Scoring iterations: 5 Odds Ratios We can obtain the odds ratio of a women candidate moving from Tories to Liberal Democrats. This corresponds to \[ \frac{\frac{\pi(x = \text{Tories})}{1 - \pi(x = \text{Tories})}\frac{\pi(x = \text{LibDem})}{1 - \pi(x = \text{LibDem})}} \] From the model, we have \(\text{logit}(\pi) = \beta_0 + \beta_1 x\), this is the same as \[ \log \left[ \frac{\pi}{1 - \pi} \right] = \beta_0 + \beta_1 x \implies \frac{\pi}{1 - \pi} = \exp[\beta_0 + \beta_1 x]. \] In R, we obtain the odds ratio as a substraction of the estimated coefficients No. 5 and No. 3 for this case. This is, \(\exp[\beta_5 - \beta3]\). exp(coef(fit)[5] - coef(fit)[3]) ## partyTories ## 0.3602814 Which means that the chances of having a women candidate drop around 65% by moving from Tories to Liberal Democrats. The opposite exercise would tell us that the chances increase around 10% by moving from Liberal Democrats to Tories. exp(coef(fit)[3] - coef(fit)[5]) ## partyLibDem ## 2.775608 Hypothesis Testing Consider the following hypothesis: \(H_0: \beta = 0\) \(H_0: \beta_{LABOUR} = \beta_{LIBDEM}\) \(H_0: \beta_{LABOUR} = \beta_{GREEN}\) To test these hypothesis we can estimate the constrats, followed by their exponentials and the respective confidence intervals. The function to use in this case corresponds to the General Linear Hypotheses. For \(H_0: \beta = 0\) we have library(multcomp) summary(glht(fit, mcp(party = "Tukey")), test = Chisqtest()) ## ## General Linear Hypotheses ## ## Multiple Comparisons of Means: Tukey Contrasts ## ## ## Linear Hypotheses: ## Estimate ## Labour - Green == 0 -0.231049 ## LibDem - Green == 0 -0.130770 ## Other - Green == 0 -0.234193 ## Tories - Green == 0 -1.151640 ## LibDem - Labour == 0 0.100278 ## Other - Labour == 0 -0.003145 ## Tories - Labour == 0 -0.920591 ## Other - LibDem == 0 -0.103423 ## Tories - LibDem == 0 -1.020870 ## Tories - Other == 0 -0.917447 ## ## Global Test: ## Chisq DF Pr(>Chisq) ## 1 45.75 4 2.773e-09 The global test returns \(p_{CALCULATED} < p_{CRITICAL}\) (\(p_{CRITICAL} = 0.05\)), therefore we reject this hypothesis. For the other hypothesis we have summary(glht(fit, mcp(party = "Tukey"))) ## ## Simultaneous Tests for General Linear Hypotheses ## ## Multiple Comparisons of Means: Tukey Contrasts ## ## ## Fit: glm(formula = gender_bin ~ party, family = binomial(link = "logit"), ## data = elections_long, weights = candidates) ## ## Linear Hypotheses: ## Estimate Std. Error z value Pr(>|z|) ## Labour - Green == 0 -0.231049 0.178269 -1.296 0.689 ## LibDem - Green == 0 -0.130770 0.176760 -0.740 0.946 ## Other - Green == 0 -0.234193 0.176391 -1.328 0.669 ## Tories - Green == 0 -1.151640 0.202841 -5.678 " />

Generalized Linear Models, Part I: The Logistic Model

[This article was first published on Pachá, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Context

Let’s say we are interested in predicting the gender of a candidate for the British General Elections in 1992 by using the Political Parties as a predictor. We have the next data:

library(dplyr)
library(tidyr)

elections <- tibble(
  party = c("Tories", "Labour", "LibDem", "Green", "Other"),
  women = c(57,126,136,60,135),
  men = c(577,508,496,192,546)
)

elections
## # A tibble: 5 × 3
##   party  women   men
##   <chr>  <dbl> <dbl>
## 1 Tories    57   577
## 2 Labour   126   508
## 3 LibDem   136   496
## 4 Green     60   192
## 5 Other    135   546

Being the dependent variable a categorical one, we need to propose a Logistic Model.

Let \(Y_i \mid \pi_i \sim Bin(n, \pi_i)\). If \(n=1\), \(Y_i\) indicates that a candidate is a woman or man.

In this case the Generalized Linear Model matches the probability of success (i.e., the probability of the candidate being a woman if we define that \(Y_i=1\) in that case and zero otherwise).

A good reference for all the mathematical details is McCullag and Nelder, 1983.

Model Specification

Before proceeding, we need to reshape the data.

elections_long <- elections %>% 
  pivot_longer(-party, names_to = "gender", values_to = "candidates") %>% 
  mutate(
    gender_bin = case_when(
      gender == "women" ~ 1L,
      TRUE ~ 0L
    )
  ) %>% 
  mutate_if(is.character, as.factor)

elections_long
## # A tibble: 10 × 4
##    party  gender candidates gender_bin
##    <fct>  <fct>       <dbl>      <int>
##  1 Tories women          57          1
##  2 Tories men           577          0
##  3 Labour women         126          1
##  4 Labour men           508          0
##  5 LibDem women         136          1
##  6 LibDem men           496          0
##  7 Green  women          60          1
##  8 Green  men           192          0
##  9 Other  women         135          1
## 10 Other  men           546          0

To specify a Generalized Linear Model that considers Gender (i.e., 1: female, 0: male) as the response and the Political Party as the predictor, we fit the proposed model in R.

fit <- glm(gender_bin ~ party,
           weights = candidates,
           family = binomial(link = "logit"),
           data = elections_long)

summary(fit)
## 
## Call:
## glm(formula = gender_bin ~ party, family = binomial(link = "logit"), 
##     data = elections_long, weights = candidates)
## 
## Deviance Residuals: 
##      1       2       3       4       5       6       7       8       9      10  
##  16.57  -10.43   20.18  -15.00   20.44  -15.50   13.12  -10.22   20.90  -15.53  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.1632     0.1479  -7.864 3.71e-15 ***
## partyLabour  -0.2310     0.1783  -1.296    0.195    
## partyLibDem  -0.1308     0.1768  -0.740    0.459    
## partyOther   -0.2342     0.1764  -1.328    0.184    
## partyTories  -1.1516     0.2028  -5.678 1.37e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2683.2  on 9  degrees of freedom
## Residual deviance: 2628.7  on 5  degrees of freedom
## AIC: 2638.7
## 
## Number of Fisher Scoring iterations: 5

Odds Ratios

We can obtain the odds ratio of a women candidate moving from Tories to Liberal Democrats. This corresponds to

\[ \frac{\frac{\pi(x = \text{Tories})}{1 – \pi(x = \text{Tories})}\frac{\pi(x = \text{LibDem})}{1 – \pi(x = \text{LibDem})}} \]

From the model, we have \(\text{logit}(\pi) = \beta_0 + \beta_1 x\), this is the same as \[ \log \left[ \frac{\pi}{1 – \pi} \right] = \beta_0 + \beta_1 x \implies \frac{\pi}{1 – \pi} = \exp[\beta_0 + \beta_1 x]. \]

In R, we obtain the odds ratio as a substraction of the estimated coefficients No. 5 and No. 3 for this case. This is, \(\exp[\beta_5 – \beta3]\).

exp(coef(fit)[5] - coef(fit)[3])
## partyTories 
##   0.3602814

Which means that the chances of having a women candidate drop around 65% by moving from Tories to Liberal Democrats.

The opposite exercise would tell us that the chances increase around 10% by moving from Liberal Democrats to Tories.

exp(coef(fit)[3] - coef(fit)[5])
## partyLibDem 
##    2.775608

Hypothesis Testing

Consider the following hypothesis:

  1. \(H_0: \beta = 0\)
  2. \(H_0: \beta_{LABOUR} = \beta_{LIBDEM}\)
  3. \(H_0: \beta_{LABOUR} = \beta_{GREEN}\)

To test these hypothesis we can estimate the constrats, followed by their exponentials and the respective confidence intervals. The function to use in this case corresponds to the General Linear Hypotheses.

For \(H_0: \beta = 0\) we have

library(multcomp)

summary(glht(fit, mcp(party = "Tukey")), test = Chisqtest())
## 
##   General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Linear Hypotheses:
##                       Estimate
## Labour - Green == 0  -0.231049
## LibDem - Green == 0  -0.130770
## Other - Green == 0   -0.234193
## Tories - Green == 0  -1.151640
## LibDem - Labour == 0  0.100278
## Other - Labour == 0  -0.003145
## Tories - Labour == 0 -0.920591
## Other - LibDem == 0  -0.103423
## Tories - LibDem == 0 -1.020870
## Tories - Other == 0  -0.917447
## 
## Global Test:
##   Chisq DF Pr(>Chisq)
## 1 45.75  4  2.773e-09

The global test returns \(p_{CALCULATED} < p_{CRITICAL}\) (\(p_{CRITICAL} = 0.05\)), therefore we reject this hypothesis.

For the other hypothesis we have

summary(glht(fit, mcp(party = "Tukey")))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: glm(formula = gender_bin ~ party, family = binomial(link = "logit"), 
##     data = elections_long, weights = candidates)
## 
## Linear Hypotheses:
##                       Estimate Std. Error z value Pr(>|z|)    
## Labour - Green == 0  -0.231049   0.178269  -1.296    0.689    
## LibDem - Green == 0  -0.130770   0.176760  -0.740    0.946    
## Other - Green == 0   -0.234193   0.176391  -1.328    0.669    
## Tories - Green == 0  -1.151640   0.202841  -5.678   <1e-04 ***
## LibDem - Labour == 0  0.100278   0.138831   0.722    0.950    
## Other - Labour == 0  -0.003145   0.138361  -0.023    1.000    
## Tories - Labour == 0 -0.920591   0.170806  -5.390   <1e-04 ***
## Other - LibDem == 0  -0.103423   0.136411  -0.758    0.941    
## Tories - LibDem == 0 -1.020870   0.169230  -6.032   <1e-04 ***
## Tories - Other == 0  -0.917447   0.168845  -5.434   <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
exp(confint(glht(fit, mcp(party = "Tukey")))[[10]])
##                  Estimate       lwr       upr
## Labour - Green  0.7937008 0.4888605 1.2886313
## LibDem - Green  0.8774194 0.5426472 1.4187205
## Other - Green   0.7912088 0.4898201 1.2780433
## Tories - Green  0.3161179 0.1821238 0.5486956
## LibDem - Labour 1.1054788 0.7579508 1.6123517
## Other - Labour  0.9968603 0.6843517 1.4520757
## Tories - Labour 0.3982834 0.2503412 0.6336541
## Other - LibDem  0.9017453 0.6223457 1.3065802
## Tories - LibDem 0.3602814 0.2274274 0.5707435
## Tories - Other  0.3995379 0.2524721 0.6322699
## attr(,"conf.level")
## [1] 0.95
## attr(,"calpha")
## [1] 2.718522

Here, the differences that are not statistically significant reveal that some parties are similar to each other (in the gender dimension), which is the case for Green vs Labour and Labour vs LibDem but not for Greens vs Tories.

Changing the Reference Factor

Consider that Green is the reference factor in the previous model. To change the reference, we can use the Tories or any other party.

library(forcats)

elections_long$party <- fct_relevel(elections_long$party, "Tories", after = 0L)

fit <- glm(gender_bin ~ party,
           weights = candidates,
           family = binomial(link = "logit"),
           data = elections_long)

fit
## 
## Call:  glm(formula = gender_bin ~ party, family = binomial(link = "logit"), 
##     data = elections_long, weights = candidates)
## 
## Coefficients:
## (Intercept)   partyGreen  partyLabour  partyLibDem   partyOther  
##     -2.3148       1.1516       0.9206       1.0209       0.9174  
## 
## Degrees of Freedom: 9 Total (i.e. Null);  5 Residual
## Null Deviance:       2683 
## Residual Deviance: 2629  AIC: 2639

Now we can compute the differences again

summary(glht(fit, mcp(party = "Tukey")))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: glm(formula = gender_bin ~ party, family = binomial(link = "logit"), 
##     data = elections_long, weights = candidates)
## 
## Linear Hypotheses:
##                       Estimate Std. Error z value Pr(>|z|)    
## Green - Tories == 0   1.151640   0.202841   5.678   <1e-04 ***
## Labour - Tories == 0  0.920591   0.170806   5.390   <1e-04 ***
## LibDem - Tories == 0  1.020870   0.169230   6.032   <1e-04 ***
## Other - Tories == 0   0.917447   0.168845   5.434   <1e-04 ***
## Labour - Green == 0  -0.231049   0.178269  -1.296    0.689    
## LibDem - Green == 0  -0.130770   0.176760  -0.740    0.946    
## Other - Green == 0   -0.234193   0.176391  -1.328    0.669    
## LibDem - Labour == 0  0.100278   0.138831   0.722    0.950    
## Other - Labour == 0  -0.003145   0.138361  -0.023    1.000    
## Other - LibDem == 0  -0.103423   0.136411  -0.758    0.941    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Green vs Tories are still different, but the sign is reversed!

To leave a comment for the author, please follow the link and comment on their blog: Pachá.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)