Logistic regression produces result that are typically interpreted in one of two ways:
 Predicted probabilities
 Odds ratios
Odds are the ratio of the probability that something happens to the probabilty it doesn’t happen.
\[
\Omega(X) = \frac{p(y=1X)}{1p(y=1X)}
\] An odds ratio is the ratio of two odds, each calculated at a different score for \(X\).
There are strengths and weaknesses to either choice.
 Predictored probabilities are intuitive, but require assuming a value for every covariate.
 Odds ratios do not require specifying values for other covariates, but ratios of ratios are not always intuitive.
Illustration using 2016 ANES: vote for Trump or Clinton. See prior blog post for details.
trump_model < glm(vote ~ gender + educ + age, data = anes,
family = binomial(link = "logit"))
estimates < round(coef(trump_model), 3)
results_tab < tidy(trump_model) %>%
mutate_if(is.numeric, funs(round(.,3))) %>%
var_mapping(term) %>%
dplyr::rename(Estimate = term, Beta = estimate, SE = std.error, z = statistic, p = p.value)
kable(results_tab, align = c("l",rep("c",4)))
Estimate  Beta  SE  z  p 

Constant  1.051  0.255  4.127  0.000 
Female  0.374  0.085  4.384  0.000 
Completed HS  0.655  0.231  2.839  0.005 
College < 4 Years  0.696  0.218  3.195  0.001 
College 4 Year Degree  0.411  0.222  1.853  0.064 
Advanced Degree  0.424  0.229  1.850  0.064 
Age  0.015  0.002  6.026  0.000 
Interpreting as odds ratios:
 The odds of voting for Trump are \(100\times[\mbox{exp}(.374) – 1]\) = 31% lower for women compared to men.
 The odds of voting for Trump are \(100\times[\mbox{exp}(.655) – 1]\) = 93% higher for those with only a high school diploma compared to those without a high school diploma.
 The odds of voting for Trump are \(100\times [\mbox{exp}(.696) – 1]\) = 101% higher for those with with some college (but no 4year degree) compared to those without a high school diploma.
 Each increase in age of one year leads to a \(100\times [\mbox{exp}(.015) – 1]\) = 2% increase in the odds of voting for Trump.
Interpreting as predicted probabilities:
predict_data < expand.grid(educ = levels(anes$educ),
gender = levels(anes$gender),
age = seq(20,90, by = 10))
pred_probs < predict_data %>%
mutate(prob_trump = predict(trump_model, newdata = predict_data,
type = "response"))
ggplot(pred_probs, aes(x = age, y = prob_trump, group = educ, color = educ)) + geom_line() +
facet_grid(~ gender) + labs(x = "Age (Years)", y = "Probability of Voting for Trump") +
scale_color_discrete(name="Highest\nEducation")
The problem:
These are sample estimates. How can we assess levels of uncertainty?
For sample estimates, we would like a standard error (SE) or a 95% confidence interval (the former usually used to create the latter).
Start with odds ratios, which at first seems easiest. How can we get a 95% CI around the odds ratio?
 The odds ratio is just \(\mbox{exp}(\beta_k)\).
 Software gives us a 95% confidence interval around \(\beta_k\).
 Create the 95% confidence interval aroune \(\beta_k\) as \(\beta_k \pm 1.96 \times \mbox{SE}_{\beta_k}\).
 Exponentiate the lower limit and the upper limit of 95% CI around \(\beta_k\).
This is what is typically done in R with exp(confint(model_object))
.
Note that, because exp()
is a nonlinear transformation, the resulting confidence intervals will be asymmetric. To aid illustration, add a random noise variable (whose standard error will be large).
set.seed(1)
anes < anes %>%
mutate(noise = rnorm(nrow(anes), 0, .05))
Fit the model.
trump_model < glm(vote ~ gender + educ + age + noise, data = anes,
family = binomial(link = "logit"))
Exponentiate the 95% confidence interval around \(\beta_k\).
confint(trump_model) %>%
exp
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.2110361 0.5737587
## genderFemale 0.5820401 0.8131175
## educCompleted HS 1.2277695 3.0374878
## educCollege < 4 Years 1.3127215 3.0913909
## educCollege 4 Year Degree 0.9793854 2.3409015
## educAdvanced Degree 0.4181680 1.0294183
## age 1.0102303 1.0201813
## noise 0.2170954 5.4118869
A graph is even better:
cis < exp(confint(trump_model)) %>%
as.data.frame %>%
rownames_to_column("Variable") %>%
var_mapping(Variable)
## Waiting for profiling to be done...
graph_data < tidy(trump_model) %>%
mutate_if(is.numeric, funs(round(.,3))) %>%
dplyr::rename(Variable = term, Beta = estimate, SE = std.error, z = statistic, p = p.value) %>%
var_mapping(Variable) %>%
mutate(OR = exp(Beta)) %>%
left_join(cis) %>%
filter(Variable != "Constant")
## Joining, by = "Variable"
ggplot(data = graph_data, aes(x = Variable, y = `OR`, ymin = `2.5 %`, ymax = `97.5 %`)) +
geom_pointrange() +
geom_hline(yintercept = 1, lty = 2) +
coord_flip() +
xlab("Variable") + ylab("Odds Ratio with 95% CI")
Delta Method Standard Errors for Odds Ratios
Alternatively, we can use the SE for the odds ratio to determine a normal (and symmetric) approximation for the 95% CI. But what is the SE for the odds ratio?
An approach known as the delta method is used frequently to come up with standard errors for nonlinear transformations of model parameters.
It is based on computing the variance for a Taylor series linearization of the function.
A Taylor Series rewrites a function at a given location \(a\) as a (possibly infinite) sum of the function’s derivatives.
\[
f(x) = f(x) + f'(a)(x – a) + \frac{f''(a)}{2!}(x – a)^2 + \frac{f'''(a)}{3!}(x – a)^3 + \ldots
\]
A Taylor series approximation chops off all but the first one or two derivatives. A linear approximation to \(f(x)\) at \(a\) is thus
\[
f(x) = f(x) + f'(a)(x – a)
\]
Take a linear transformation of a random variable \(x\).
\[
f(x) = a + bx
\]
The variance of \(f(x)\) is known to be
\[
\mbox{Var}(a + bx) = b^2\mbox{Var}(x)
\] or
\[
\mbox{Var}(a + bx) = b\mbox{Var}(x)b
\] or
\[
\mbox{Var}(a + bx) = f'(x)\mbox{Var}(x)f'(x)
\]
Generalizing to any (univariate) differentiable function of a random varaible \(x\) with mean \(\mu\), we can approximate a function of \(x\) at \(\mu\) with
\[
f(x) \approx f(\mu) + f'(x)(x – \mu)
\]
meaning that the variance of the function is
\[
\mbox{Var}\left(f(x)\right) = f'(\mu)\mbox{Var}(x)f'(\mu)
\]
This generalizes to functions of multiple variables. Simply replace the derivate with the gradient vector and the variance with the variancecovariance matrix.
\[
\mbox{Var}\left(f(\mathbf x)\right) = \nabla(\boldsymbol\mu )^{T}\mbox{Cov}(\mathbf x)\nabla(\boldsymbol\mu)
\]
Going back to logistic regression, our random variables are the sample estimates \(\widehat{\beta}_k\), and our function is \(f(\beta_k) = e^{\beta_k}\). The maximum likelihood estimates are the values for the vector \(\boldsymbol \mu\). The covariance matrix is the covariance matrix of the estimates. Both are easily recovered in R from a glm
object.
coef(trump_model)
vcov(trump_model)
The function in which we are interested is
\[
f(\beta_k) = e^{\beta_k}
\] The deltamethod variance is
\[
f(\beta_k) = f'(\beta_k)\mbox{Var}(\widehat{\beta_k})f'(\beta_k)
\]
This turns out to be a pretty simple problem, given that
\[
\frac{d e^x}{dx} = e^x
\]
What would the variance be for the odds ratio on the noise term?
 The estimate \(\beta_{\mbox{noise}}\) was 0.0804698.
 The variance was 0.6725694
\[
\mbox{Var}\left(\mbox{exp}\left(\beta_{\mbox{noise}}\right)\right) = e^{\beta}\mbox{Var}\left(\hat{\beta}\right)e^{\beta} = 0.79
\]
A normalapproximated 95% confidence interval is found as
\[
95\% \mbox{ CI} = \beta_k \pm 1.96 \times \mbox{SE}_{\beta_k}
\]
where \(\mbox{SE}\) is the square root of the variance.
A function to return odds ratios and confidence intervals based on normal approximations.
OR_SE < function(glm_object, variable){
d < exp(coef(glm_object)[variable])
v < vcov(glm_object)[variable, variable]
output_df < as_tibble(list(Estimate = variable,
Beta = coef(glm_object)[variable],
OR = d,
`OR SE` = sqrt(d*v*d))) %>%
mutate(`Lower CI` = OR  1.96*`OR SE`,
`Upper CI` = OR + 1.96*`OR SE`)
output_df
}
Map over all estimates and reduce to tibble.
or_tibble < map(names(coef(trump_model)), function(i) OR_SE(trump_model, i)) %>%
reduce(bind_rows)
kable(or_tibble %>% var_mapping(Estimate), align = c("l", rep("c", 5)))
Estimate  Beta  OR  OR SE  Lower CI  Upper CI 

Constant  1.0513062  0.3494810  0.0890344  0.1749736  0.5239883 
Female  0.3738035  0.6881121  0.0586800  0.5730993  0.8031249 
Completed HS  0.6543854  1.9239596  0.4436975  1.0543126  2.7936067 
College < 4 Years  0.6962532  2.0062217  0.4372836  1.1491458  2.8632976 
College 4 Year Degree  0.4109402  1.5082351  0.3344887  0.8526372  2.1638330 
Advanced Degree  0.4244709  0.6541158  0.1500084  0.3600993  0.9481323 
Age  0.0150621  1.0151761  0.0025378  1.0102021  1.0201501 
0.0804698  1.0837961  0.8888248  0.6583004  2.8258927 
cis < exp(confint(trump_model)) %>%
as.data.frame %>%
rownames_to_column("Variable") %>%
var_mapping(Variable)
## Waiting for profiling to be done...
graph_data_2 < tidy(trump_model) %>%
dplyr::rename(Variable = term, Beta = estimate, SE = std.error, z = statistic, p = p.value) %>%
mutate_if(is.numeric, funs(round(.,3))) %>%
var_mapping(Variable) %>%
mutate(OR = exp(Beta)) %>%
left_join(cis) %>%
mutate(Source = "exp(confint(.))") %>%
dplyr::select(Source, Variable, OR, `Lower CI` = `2.5 %`, `Upper CI` = `97.5 %`) %>%
bind_rows(or_tibble %>%
mutate(Source = "Delta Method") %>%
dplyr::select(Source, Variable = Estimate, OR, `Lower CI`, `Upper CI`) %>%
var_mapping(Variable)) %>%
mutate(Source = factor(Source, levels = c("exp(confint(.))", "Delta Method"))) %>%
filter(Variable != "Constant")
## Joining, by = "Variable"
ggplot(data = graph_data_2, aes(x = Variable, y = OR, ymin = `Lower CI`, ymax = `Upper CI`)) +
geom_pointrange() +
geom_hline(yintercept = 1, lty = 2) +
coord_flip() +
xlab("Variable") + ylab("Odds Ratio with 95% CI") +
facet_grid(~ Source)
Which is the correct method?
 Delta method provides a standard error for the odds ratio, which can be used to create a normalapproximated (i.e. symmetric) confidence interval.
 But delta method confidence intervals can also extend into negative territory.
What does Stata do?
 Stata reports standard errors for odds ratios determined by the delta method.
 But its 95% confidence intervals around the odds ratios are based on \(\mbox{exp}(\beta \pm 1.96*\mbox{SE}_{\beta})\).
That is, the standard error is the delta method, but the confidence intervals are equal to Rs exp(confint(model_object))
!
Let’s export our data to Stata and take a look.
haven::write_dta(anes, "aneswithnoise.dta")
. logistic vote i.gender i.educ age noise
Logistic regression Number of obs = 2,368
LR chi2(7) = 147.10
Prob > chi2 = 0.0000
Log likelihood = 1565.3592 Pseudo R2 = 0.0449

vote  Odds Ratio Std. Err. z P>z [95% Conf. Interval]
+
gender 
Female  .6881121 .05868 4.38 0.000 .582199 .8132929

educ 
Completed HS  1.92396 .4436975 2.84 0.005 1.224319 3.023412
College < 4 Years  2.006222 .4372836 3.19 0.001 1.308723 3.07546
College 4 Year Degree  1.508235 .3344887 1.85 0.064 .9765486 2.329401
Advanced Degree  .6541158 .1500084 1.85 0.064 .4173001 1.025323

age  1.015176 .0025378 6.03 0.000 1.010214 1.020162
noise  1.083796 .8888249 0.10 0.922 .2172073 5.407803
_cons  .349481 .0890344 4.13 0.000 .2121143 .5758072

Comparing the standard errors from Stata to our delta method SEs, we find the two match.
But the delta method CIs do not match. In fact, Stata’s confidence intervals are close to R’s exp(confint())
results.
Understanding Stata’s output for logit models with results reported as odds ratios:
 The CIs are exponentiated \(\widehat{\beta_k} \pm 1.96 \times \mbox{SE}_{\widehat{\beta_k}}\).
 The pvalues reported for the odds ratio come from \(z = \frac{\widehat{\beta_k}}{\mbox{SE}_{\widehat{\beta_k}}}\). To see this, compare the pvalues with and without the
or
option to the.logit
command.
If the sampling distribution is asymmetric, then the confidence interval should be asymmetric. We can use the nonparametric bootstrap to get a sense of the shape of the sampling distribution. The R code applied to the ANES data is:
or_boot_func < function(form, data, indices) {
data < data[indices,]
ors < exp(coef(glm(form, data = data, family = "binomial")))
ors
}
or_boot_values < boot(data = anes, statistic = or_boot_func,
R = 1000,
form = vote ~ gender + educ + age + noise)
boot_ci < apply(or_boot_values$t, 2, function(i) {
as_tibble(list(Estimate = mean(i),
Lower = quantile(i, .025),
Upper = quantile(i, .975)))
}) %>%
reduce(bind_rows)
boot_ci %>%
mutate(Variable = c("Constant", graph_data$Variable)) %>%
filter(Variable != "Constant") %>%
ggplot(aes(x = Variable, y = `Estimate`, ymin = `Lower`, ymax = `Upper`)) +
geom_pointrange() +
geom_hline(yintercept = 1, lty = 2) +
coord_flip() +
xlab("Variable") + ylab("Bootrapped Mean Odds Ratio and 2.597.5 Pecentiles")
Delta Method Standard Errors for Predicted Probabilities
It makes most sense to stick with exp(confint(glm_object))
for confidence intervals around ORs, since this is a trivial task. But what about other quantities that are functions of model parameters, such as predicted probabilities?
Predicted probabilities are obtained from the results of a logit model by:
\[
\begin{array}
\mbox{Pr}(y=1x) &= F(x\beta) \\
& = \frac{\mbox{exp}(x\beta)}{1 + \mbox{exp}(x\beta)}
\end{array}
\]
What is the prediction for a 55yearold male who finished high school but did not go to college (with average noise)? First, get \(x\beta\).
estimates < coef(trump_model)
xb < estimates[1] + estimates[2]*0 + estimates[3]*1 + estimates[4]*0 + estimates[5]*0 +
estimates[6]*0 + estimates[7]*55 + estimates[8]*0
(xb < as.numeric(round(xb, 3)))
## [1] 0.431
\[
\begin{array}
\mbox{Pr}(y=1x) &= \frac{\mbox{exp}(x\beta)}{1 + \mbox{exp}(x\beta)} \\
& = \frac{\mbox{exp}(0.431)}{1 + \mbox{exp}(0.431)} \\
& = 0.606
\end{array}
\]
Equivalently.
round(plogis(xb),3)
## [1] 0.606
The predicted probabilty is a function of all parameter estimates, so we need to use the matrix version of the Taylor series approximation to get our SE.
\[
\mbox{Var}\left(f(\mathbf x)\right) = \nabla(\boldsymbol\mu )^{T}\mbox{Cov}(\mathbf x)\nabla(\boldsymbol\mu)
\]
First, determine the gradient vector for the \(\beta_k\).
We’ve used \(F()\) to represent the standard logistic cdf. Let \(f()\) be the standard logistic pdf.
By the chain rule, and by the fact that the first derivative of a cdf is its pdf,
\[
\begin{array}
\frac{\partial F(x'\beta)}{\partial \beta} &= \frac{\partial F((x' \beta))}{\partial x' \beta} \\
&= f(x'\beta)x
\end{array}
\]
Define the \(x\) vector for our subject of interest along with the beta vector from the model results.
x < c(1,0,1,0,0,0,55,0)
beta < coef(trump_model)
cbind(beta,x)
## beta x
## (Intercept) 1.05130623 1
## genderFemale 0.37380350 0
## educCompleted HS 0.65438538 1
## educCollege < 4 Years 0.69625321 0
## educCollege 4 Year Degree 0.41094016 0
## educAdvanced Degree 0.42447089 0
## age 0.01506207 55
## noise 0.08046982 0
And \(x'\beta\) is
xb < t(x)%*%beta
The variance is thus:
\[
\begin{array}
\mbox{Var}(p(\mbox{Vote} = \mbox{Trump})) &= \nabla(\boldsymbol\mu )^{T}\mbox{Cov}(\mathbf x)\nabla(\boldsymbol\mu) \\
&= f(x'\beta)x'\mbox{Cov}(\widehat{\beta})xf(x'\beta)
\end{array}
\]
Calculating “by hand” in R:
dlogis(xb)%*%t(x)%*%vcov(trump_model)%*%x%*%dlogis(xb) %>%
sqrt
## [,1]
## [1,] 0.02747443
The deltamethod
function in the msm
package will also do this for you. The arguments are
 The function expressed as a formula
 A vector of means, here \(\mu = \beta\)
 The covariance matrix
deltamethod(
~ exp(x1 + x2*0 + x3*1 + x4*0 + x5*0 + x6*0 + x7*55 + x8*0)/
(1 + exp(x1 + x2*0 + x3*1 + x4*0 + x5*0 + x6*0 + x7*55 + x8*0)),
coef(trump_model), vcov(trump_model))
## [1] 0.02747443
This is the standard error for our predicted probability.
One catch: because the derivatives are found symbolically, you are limited to what stats::deriv
knows. For example, the following would be simpler to code.
deltamethod(~ plogis(x1 + x2*0 + x3*1 + x4*0 + x5*0 + x6*0 + x7*55 + x8*0),
coef(trump_model), vcov(trump_model))
But R throws an error, Function plogis is not in the derivatives table.
By the way, this is where Stata truly shines and why I still turn to it despite protestations that “R can do everything.” Here are all of the predicted probabilities, with standard errors, for all combinations of education and gender, for 55 yearolds.
. margins educ, over(gender) at(age = 55 noise = 0)

 Deltamethod
 Margin Std. Err. z P>z [95% Conf. Interval]
+
gender#educ 
Male#HS Incomplete  .4445065 .0514349 8.64 0.000 .3436959 .545317
Male#HS Complete  .6062301 .0274744 22.07 0.000 .5523812 .660079
Male#Some College  .6161789 .0205271 30.02 0.000 .5759466 .6564112
Male#BA/BS  .5468739 .0231694 23.60 0.000 .5014627 .5922851
Male#Higher  .343584 .0251349 13.67 0.000 .2943204 .3928475
Female#HS Incomplete  .3551 .0482039 7.37 0.000 .2606221 .4495779
Female#HS Complete  .5144184 .0278629 18.46 0.000 .4598081 .5690286
Female#Some College  .5248688 .0201584 26.04 0.000 .485359 .5643786
Female#BA/BS  .4536941 .0231244 19.62 0.000 .4083711 .4990171
Female#Higher  .2648002 .0215839 12.27 0.000 .2224966 .3071038

Type one more word, and this becomes a pretty graph.
. marginsplot
Again, the delta method gives us an approximation that may not be accurate.
Predicted probabilities are bounded by zero and one, yet a delta method CI may extend below or above these limits.
Alternatives:
 Boostrapping
 Simulations by repeated draws from \(\boldsymbol \beta \sim \mathcal{N}\left(\widehat{\boldsymbol \beta}, \mbox{Cov}(\widehat{\boldsymbol \beta})\right)\).
Both are computationally intensive and may be less attractive for biger data sets than the delta method.
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...