Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The purpose of this blog post is to review the derivation of the logit estimator and the interpretation of model estimates. Logit models are commonly used in statistics to test hypotheses related to binary outcomes, and the logistic classifier is commonly used as a pedagogic tool in machine learning courses as a jumping off point for developing more sophisticated predictive models. A secondary goal is to clarify some of the terminology related to logistic models, which – as should already be clear given the interchanging usage of “logit” and “logistic” – may be confusing. A tertiary motivation is to have a post that future posts elaborating covering causal inference and classifiers for categorical outcomes can link back to.

The following are points to keep in mind:

1. The terms “logit model”, “logistic model”, and “logistic regression model” all refer to the same thing; usage varies by discipline.
2. Logistic regression can be interpreted in many ways, but the most common are in terms of odds ratios and predicted probabilities.
• Predicted probabilities are prefered by most social scientists and the machine learning community while odds ratios are more common in biostatistics and epidemiology.
• Interpreting how much probabilities change given a change in one predictor requires setting values for all predictors.
• Interpreting how much odds change for a change in one predictor does not require taking into account other predictors, though thinking of terms of “odds” is less intuitive.

To begin, the primary reasons we prefer not to use linear regression for categorical outcomes are the following:

1. Impossible predictions, $$p(y = 1) \notin [0,1]$$
2. Heteroskedasticity (errors have non-zero variance only at $$y = 0$$ and $$y = 1$$)
3. Improper functional forms (assumes constant rate of change, even at extremes)

Say we are trying to determine tolerance for Justin Bieber by alcohol consumption as measured by blood/alcohol levels.

Interpreting the y-axis as the probability of having Bieber fever, there are predictions less than zero. Also, there are only observations when Bieber fever is 0 or 1, and the model assumes that each increase of .1 in blood/alcohol levels has the same effect on the probability of Bieber fever at low, medium, and high levels of consumption.

We can, however, assume that Bieber fever is a continuous concept. If it were continuous, the linear model may apply without these pathologies. As a bridge between the unobserved continuous variable and the observed dichotomization, assume that a threshold $$\tau$$ exists such that, when latent Bieber fever is below it, we observe no Bieber fever, and when latent Bieber fever is above it, we observe Yes.

To generalize, denote a latent (unobserved) continuous outcome as $$y^{\ast}$$. We need to map the linear model for $$y^*$$ onto the range $$[0, 1]$$ for the probabilities. How do we get there?

We can fall back on our linear model for the latent variable $$y^{\ast}$$ to determine the appropriate function that maps the linear model onto a probability.

$p(y=1|\mathbf{x}) = p(y^* > \tau | \mathbf{x})$

That is, the probability of observing a one is equal to the probability of observing a value of $$y^{\ast}$$ greater than the threshold $$\tau$$.

To simplify, set $$\tau = 0$$. We can do this because $$y^{\ast}$$ is unobserved, so its scale is arbitrary. Then

\begin{aligned} p(y=1|x) &= p(y^* > \tau | x) \\ &= p(y^* > 0 | x) \end{aligned}

Substitute the linear model in for $$y^{\ast}$$ and rearrange

\begin{aligned} p(y=1|x) &= p(x\beta + \epsilon > 0 | x) \\ &= p(\epsilon > – x \beta | x ) \\ &= p(\epsilon \leq x\beta | x ) \end{aligned}

The last line

$p(y=1|x) = p(\epsilon \leq x\beta | x )$

says that $$p(y = 1 | x)$$ is equal to the cumulative distribution function for $$\epsilon$$ at a given value of $$x$$.

Make an assumption that this error distribution follows a standard logistic distribution. The cdf for a standard logistic distribution is

$F(\epsilon) = \frac{\mbox{exp}(\epsilon)}{1 + \mbox{exp}(\epsilon)} .$

Because

$p(y=1|x) = p(\epsilon \leq x\beta | x )$

is the cdf evaluated at $$x\beta$$, we can write this as

$p(y=1|x) = F(x\beta) .$ The $$x\beta$$ is a linear transformation. It gets stuck inside the standard logistic cdf, which maps the results onto the $$[0,1]$$ range.

Recall that the pmf for a Bernoulli distribution is

$f(y_i) = \pi_i^y(1-\pi_i)^{1-y},$

and that

$\pi_i = p(y=1|x) = p(\epsilon \leq x\beta | x ) .$

Assuming $$N$$ observations that are independent, the likelihood function is

$L(\beta | y, x) = \Pi_{i=1}^{N}\pi_i .$

Combining and a little re-writing:

$L(\beta | y, X) = \Pi_{y=1} p(y_i = 1 | x_i)\Pi_{y=0}[1-p(y_i) = 1 | x]$

$L(\beta | y, X) = \Pi_{y=1} p(y_i = 1 | x_i)\Pi_{y=0}[1-p(y_i) = 1 | x]$

Substitute what we derived for $$p(y = 1)$$

$L(\beta | y, X) = \Pi_{y=1}F(x\beta)\Pi_{y=0}[1-F(x\beta)]$

Taking logs for tractability

$L(\beta | y, X) = \Sigma_{y=1}\mbox{ln}F(x\beta)\Sigma_{y=0}\mbox{ln}[1-F(x\beta)]$

Optimizing numerically gives us our estimates for $$\beta$$.

An example using data from the American National Election Studies

Who voted for Trump?

Test for associations by

1. Age
2. Education
3. Gender

Fit the logistic regression model.

trump_model <- glm(vote ~ gender + educ + age, data = anes,
family = binomial(link = "logit"))

Prettying up the output:

estimates <- round(coef(trump_model), 3)

coef(summary(trump_model)) %>%
as.data.frame %>%
rownames_to_column("Variable") %>%
mutate_if(is.numeric, funs(round(.,3))) %>%
var_mapping(Variable)  %>%
kable(align = c("l",rep("c",4)))
Variable Estimate Std. Error z value Pr(>|z|)
Constant -1.051 0.255 -4.127 0.000
Female -0.374 0.085 -4.384 0.000
Completed HS 0.655 0.231 2.839 0.005
College < 4 Years 0.696 0.218 3.195 0.001
College 4 Year Degree 0.411 0.222 1.853 0.064
Advanced Degree -0.424 0.229 -1.850 0.064
Age 0.015 0.002 6.026 0.000

The var_mapping function was created for these data to map variable names to clean labels. See code at the bottom of this post.

The estimates returned by the glm() function are the coefficients for the linear part of the logit model,

$y^* = \alpha + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_kx_k + \epsilon$

The prediction for a 55-year-old male who finished high school but did not go to college is:

pp1 <- estimates[1]+ estimates[2]*0 + estimates[3]*1 + estimates[4]*0 + estimates[5]*0 + estimates[6]*0 + estimates[7]*55

$\begin{array}[r l l] \hat{y^*} &=& -1.051 + -0.374(0) + 0.655(1) + \\\ & & 0.696(0) + 0.411(0) + -0.424(0) + *0.015(55) \\ &=& 0.429 \end{array}$

The scale of $$y^*$$ is arbitray, so the meaning of this value is ambiguous.

$\begin{array} \mbox{Pr}(y=1|x) &= F(x\beta) \\ &= F(0.429) \\ & = \frac{\mbox{exp}(0.429)}{1 + \mbox{exp}(0.429)} \\ & = 0.61 \end{array}$

In R,

plogis(.429)
## [1] 0.6056349

Note that the coefficient estimates are for the linear model regressing $$y^*$$ on the independent variables.

The model for predicted probabilities is not linear. A change in x has a non-constant effect on the change in probability.

Let x be a vector of $$k > 1$$ independent variables, and let $$\beta$$ be the corresponding coefficients. The partial derivitive for a change in one independent variable $$x_k$$ is

$\begin{eqnarray} \frac{\partial Pr(y = 1 | x_k)}{\partial x_k} &= \frac{\partial F(x\beta)}{x_k} \\ &= \frac{dF(x\beta)}{dx\beta}\frac{\partial x\beta}{\partial x_k} \\ &= f(x\beta)\beta_k . \end{eqnarray}$

This means that the change in the predicted probability depends not just on the value of $$x_k$$ but also on the values of all of the $$x$$s.

When we use the model to predict a probability, we have to assume values for all of the variables, e.g. a 55-year-old male who finished high school but did not go to college.

To get a sense of how predicted probabilities change, the base R expand.grid() function is quite helpful. We can create a grid of different combinations of independent variables:

predict_data <- expand.grid(educ    = levels(anes$educ), gender = levels(anes$gender),
age     = seq(20,90, by = 10))
##                     educ gender age
## 1       HS Not Completed   Male  20
## 2           Completed HS   Male  20
## 3      College < 4 Years   Male  20
## 4  College 4 Year Degree   Male  20
## 5        Advanced Degree   Male  20
## 6       HS Not Completed Female  20
## 7           Completed HS Female  20
## 8      College < 4 Years Female  20
## 9  College 4 Year Degree Female  20
## 10       Advanced Degree Female  20
## 11      HS Not Completed   Male  30
## 12          Completed HS   Male  30
## 13     College < 4 Years   Male  30
## 14 College 4 Year Degree   Male  30
## 15       Advanced Degree   Male  30

Plug these in as the new data frame for the predict() method for glm objects.

pred_probs <- predict_data %>%
mutate(prob_trump = predict(trump_model, newdata = predict_data,
type = "response"))

Note the type argument equals response. By default, predict.glm() will return the estimate of $$y^*$$.

Use these to generate plots.

ggplot(pred_probs, aes(x = age, y = prob_trump, group = educ, color = educ)) + geom_line() +
facet_grid(~ gender) + labs(x = "Age (Years)", y = "Probability of Voting for Trump") +
scale_color_discrete(name="Highest\nEducation")

These are of course sample estimates but are presented without confidence intervals. How can we assess uncertainty?

1. Nonparametric bootstrap
2. Parametric bootstrap (e.g. zelig)
3. Delta method

The last of these will be covered in a subsequent blog post.

Log-Odds

Note that predicted probabilties require specifying values for all covariates just to interpret one independent variable. The logit model can also be derived as a model of log odds, which does not require setting values for all predictors.

Odds versus probability:

• Probability ranges from 0 (impossible) to 1 (happens with certainty).
• Odds is the ratio of the probability something happens to the probability it won’t happen.

Example: - The probability I roll a 7 on the first roll of two dice is .167. - This means the probability of rolling anything else is 1 - .167 = .833 - The odds of rolling anything other than a seven are therefore .833/.167 = 5.
- That is, the odds are nearly five to one that you will roll something other than a seven.

The odds of rolling the seven are .167/.833 = .2. But how should we interpret this since it is less than one?

• Actually, this is just the inverse of the odds of not rolling seven.
• $$.2 = \frac{1}{5}$$.
• Compare to $$\frac{5}{1}$$
• That is, the odds are nearly five to one that you will roll something other than a seven.

The odds ratio is (surprise, surprise) the ratio of the odds.

• The odds of rolling a 7 is .2.
• The odds of rolling anything else is 5.
• The odds ratio is 5/.2 = 25.
• That is, the odds of not rolling a seven are 25 times larger than the odds of rolling a seven.

It turns out that, in the case of the logit model, $$y^{\ast}$$ is equivalent to the log of the odds.

$\mbox{log}\left(\frac{Pr(y=1| X_i)}{1-Pr(y=1|X_i)}\right) = X_i\beta = y_i^{\ast}$

A one-unit change in $$x_k$$ corresponds to a $$\beta_k$$-unit change in the log of the odds of choosing 1. That is, our model is linear in log-odds.

The problem is that nobody really thinks in terms of “the natural logarithm of the odds of an event.”

So to interpret, we unlog the left-hand side Let $$\left( \Omega(X)\right)$$ be the odds. Because

$\mbox{log}\left( \Omega(X)\right) = \mbox{log}\left(\frac{p(y=1|X)}{1-p(y=1|X)}\right) = X\beta ,$

then

$\begin{eqnarray} \Omega(X) &= \frac{p(y=1|X)}{1-Pr(y=1|X)} = \mbox{exp}\left(X\beta\right) \\ &= \mbox{exp}\left(\alpha + \beta_1X_1 + \ldots + \beta_kX_k\right) \\ &= \mbox{exp}(\alpha)\mbox{exp}(\beta_1X_1) \ldots \mbox{exp}(\beta_kX_k) . \end{eqnarray}$

where the last line comes from the rules for exponents.

Interpretation can now be done in terms of odds ratios. For a change of $$\delta$$ in an independent variable, say $$x_1$$, the odds ratio is

$\begin{eqnarray} \frac{\Omega(x_1 + \delta, X)}{\Omega(x_1, X)} &= \frac{\mbox{exp}(\alpha)\mbox{exp}(\beta_1x_1)\mbox{exp}(\beta_1\delta)\mbox{exp}(\beta_2x_2) \ldots \mbox{exp}(\beta_kx_k)} {\mbox{exp}(\alpha)\mbox{exp}(\beta_1x)\mbox{exp}(\beta_2x_2) \ldots \mbox{exp}(\beta_kx_k)} \\ & \\ & = \mbox{exp} ( \beta_1 \delta ) \end{eqnarray}$

A unit-change in $$x_k$$ ($$\delta=1$$) thus leads to a change of $$e^{\beta_k}$$ in the odds.

This interpretation of odds ratios is the following:

• A value greater than one means the odds are getting larger.
• A value less than one means the odds are getter smaller.
• A value of one means there is no change in the odds for a change in $$x_k$$.

Also,

• Interpretation of one variable does not depend on values of other variables.
• A hand calculator can return $$e^{\beta_k}$$, whereas predicted probabilities require a more complicated transformation.

How do we interept the following?

Variable Estimate Std. Error z value Pr(>|z|)
Constant -1.051 0.255 -4.127 0.000
Female -0.374 0.085 -4.384 0.000
Completed HS 0.655 0.231 2.839 0.005
College < 4 Years 0.696 0.218 3.195 0.001
College 4 Year Degree 0.411 0.222 1.853 0.064
Advanced Degree -0.424 0.229 -1.850 0.064
Age 0.015 0.002 6.026 0.000
• The odds of voting for Trump are $$100\times\mbox{exp}(-.374) - 1$$ = 68% lower for women compared to men.
• The odds of voting for Trump are $$100\times\mbox{exp}(.654) - 1$$ = 191% higher for those with only a high school diploma compared to those without a high school diploma.
• The odds of voting for Trump are $$100\times\mbox{exp}(.696) - 1$$ = 200% higher for those with with some college (but no 4-year degree) compared to those without a high school diploma.
• Each increase in age of one year leads to a $$100\times\mbox{exp}(.015) - 1$$ = 101% increase in the odds of voting for Trump.

Again, we have sample estimates. Getting confidence intervals for odds ratios is a little more straightforward compared to predicted probabilities.

1. Determine the 95% confidence interval around the coefficients
2. Exponentiate the lower and upper boundaries.

In R,

confint(trump_model) %>%
exp
## Waiting for profiling to be done...
##                               2.5 %    97.5 %
## (Intercept)               0.2109995 0.5736639
## genderFemale              0.5820164 0.8130810
## educCompleted HS          1.2282458 3.0384294
## educCollege < 4 Years     1.3129020 3.0918562
## educCollege 4 Year Degree 0.9792916 2.3407350
## age                       1.0102309 1.0201818

Note that exp() is a nonlinear transformation, so the confidence intervals will not always be symmetric, especially when standard errors are large.

Code

This is a post on interpretation, not on coding, so much of the R code has been elided in the above. The data were downloaded from the ANES website. The following code was used to clean the data and load the packages used in this analysis.

library(tidyverse)
library(ggplot2)
library(knitr)
library(boot)

set.seed(12345)

anes <- read_delim("data/anes_timeseries_2016_rawdata.txt", delim = "|") %>%
select(vote = V162034a, V161270, gender = V161342, age = V161267) %>%
mutate_all(as.numeric) %>%
filter(vote %in% c(1,2) & gender %in% c(1,2)) %>%
mutate(vote = factor(vote, levels = 1:2, labels = c("Clinton", "Trump")),
educ = case_when(
V161270 %in% 1:8 ~ 1,
V161270 %in% 9 ~ 2,
V161270 %in% 10:12 ~ 3,
V161270 %in% 13 ~ 4,
V161270 %in% 14:16 ~ 5,
TRUE ~ -999),
gender = factor(gender, levels = 1:2, labels = c("Male", "Female"))) %>%
mutate(educ = factor(educ, level = 1:5, labels = c("HS Not Completed",
"Completed HS",
"College < 4 Years",
"College 4 Year Degree",
filter(!is.na(educ) & age >= 18) %>%
dplyr::select(-V161270)

var_mapping <- function(df, var) {

var <- enquo(var)

df_recode <- df %>%
mutate(!!quo_name(var) := case_when(
(!!var) == "(Intercept)" ~ "Constant",
(!!var) == "genderFemale" ~ "Female",
(!!var) == "educCompleted HS" ~ "Completed HS",
(!!var) == "educCollege < 4 Years" ~ "College < 4 Years",
(!!var) == "educCollege 4 Year Degree" ~ "College 4 Year Degree ",
(!!var) == "age" ~ "Age",
TRUE ~ ""))

df_recode

}