Omitted Variable Effects in Logistic Regression

[This article was first published on R – Win Vector LLC, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

I would like to illustrate a way which omitted variables interfere in logistic regression inference (or coefficient estimation). These effects are different than what is seen in linear regression, and possibly different than some expectations or intuitions.

Our Example Data

Let’s start with a data example in R.

# example variable frame
x_frame <- data.frame(
  x = c(-2, 1),
  wt = 1
)

x_frame is a data.frame with a single variable called x, and an example weight or row weight called wt.

# example variable frame
omitted_frame <- data.frame(
  omitted = c(-1, 1),
  wt = 1
)

omitted_frame is a data.frame with a single variable called omitted, and an example weight called wt.

For our first example we take the cross-product of these data frames to get every combination of variable values, and their relative proportions (or weights) in the joined data frame.

# combine frames by cross product, and get new relative data weights
d <- merge(
  x_frame, 
  omitted_frame, 
  by = c())
d$wt = d$wt.x * d$wt.y
d$wt <- d$wt / sum(d$wt)
d$wt.x <- NULL
d$wt.y <- NULL
# our data
knitr::kable(d)
x omitted wt
-2 -1 0.25
1 -1 0.25
-2 1 0.25
1 1 0.25

The idea is: d is specifying what proportion of an arbitrarily large data set (with repeated rows) has each possible combination of values. For us, d is not a sample- it is an entire population. This is just a long-winded way of trying to explain why we have row weights and why we are not concerned with observation counts, uncertainty bars, or significances/p-values for this example.

Let’s define a few common constants: Euler's constant, pi, and e.

# 0.5772
(Euler_constant <- -digamma(1))
## [1] 0.5772157
# 3.1415
pi
## [1] 3.141593
# 2.7182
(e <- exp(1))
## [1] 2.718282

Please remember these constants in this order for later.

# show constants in an order will see again
c(Euler_constant, pi, e)
## [1] 0.5772157 3.1415927 2.7182818

The Linear Case

For our example we call our outcome (or dependent variable) y_linear. We say that it is exactly the following linear combination of a constant plus the variables x and omitted.

# assign an example outcome or dependent variable
d$y_linear <- Euler_constant + pi * d$x + e * d$omitted
# our data with outcome
knitr::kable(d)
x omitted wt y_linear
-2 -1 0.25 -8.424252
1 -1 0.25 1.000527
-2 1 0.25 -2.987688
1 1 0.25 6.437090

As we expect, linear regression can recover the constants of the linear equation from data.

# inferring coefficients from data
lm(
  y_linear ~ x + omitted, 
  data = d, 
  weights = d$wt, 
  )$coef
## (Intercept)           x     omitted 
##   0.5772157   3.1415927   2.7182818

Notice the recovered coefficients are the three constants we specified.

This is nice, and as expected.

Omitting a Variable

Now we ask: what happens if we omit from the model the variable named “omitted”? This is a central problem in modeling. We are unlikely to know, or be able to measure, all possible explanatory variables in many real world settings. We are often omitting variables, as we don’t know about them or have access to their values!

For this linear regression model, we do not expect omitted variable bias as the variables x and omitted, by design, are fully statistically independent.

We can confirm omitted is nice, in that it is mean-0 and has zero correlation with x under the specified data distribution.

# mean 0 check
sum(d$omitted * d$wt) / sum(d$wt)
## [1] 0
# no correlation check
knitr::kable(
  cov.wt(
    d[, c('x', 'omitted')],
    wt = d$wt
  )$cov
)
x omitted
x 3 0.000000
omitted 0 1.333333

All of this worrying pays off. If we fit a model with the omitted variable left out, we still get the original estimates of the x-coefficient and the intercept.

# inferring coefficients, with omitted variable
lm(
  y_linear ~ x, 
  data = d,
  weights = d$wt, 
  )$coef
## (Intercept)           x 
##   0.5772157   3.1415927

The Logistic Case

Let’s convert this problem to modeling the probability distribution of a new outcome variable, called y_observed that takes on the values TRUE and FALSE. We use the encoding strategy from “replicate linear models” (which can simplify steps in many data science projects). How this example arises isn’t critical, we want to investigate the properties of this resulting data. So let’s take a moment and derive our data.

# converting "links" to probabilities
sigmoid <- function(x) {1 / (1 + exp(-x))}

d$y_probability <- sigmoid(d$y_linear)
# encoding effect as a probability model over a binary outcome
# method used for model replication
# ref: https://win-vector.com/2019/07/03/replicating-a-linear-model/
d_plus <- d
d_plus$y_observed <- TRUE
d_plus$wt <- d_plus$wt * d_plus$y_probability
d_minus <- d
d_minus$y_observed <- FALSE
d_minus$wt <- d_minus$wt * (1 - d_minus$y_probability)
d_logistic <- rbind(d_plus, d_minus)
d_logistic$wt <- d_logistic$wt / sum(d_logistic$wt)
# our data with binary outcome
knitr::kable(d_logistic)
x omitted wt y_linear y_probability y_observed
-2 -1 0.0000549 -8.424252 0.0002194 TRUE
1 -1 0.1827905 1.000527 0.7311621 TRUE
-2 1 0.0119963 -2.987688 0.0479852 TRUE
1 1 0.2496004 6.437090 0.9984015 TRUE
-2 -1 0.2499451 -8.424252 0.0002194 FALSE
1 -1 0.0672095 1.000527 0.7311621 FALSE
-2 1 0.2380037 -2.987688 0.0479852 FALSE
1 1 0.0003996 6.437090 0.9984015 FALSE

The point is: this data has our original coefficients encoded in it as the coefficients of the generative process for y_observed. We confirm this by fitting a logistic regression.

# infer coefficients from binary outcome
# suppressWarnings() only to avoid "fractional weights message"
suppressWarnings(
  glm(
    y_observed ~ x + omitted, 
    data = d_logistic, 
    weights = d_logistic$wt, 
    family = binomial(link = "logit")
    )$coef
)
## (Intercept)           x     omitted 
##   0.5772151   3.1415914   2.7182800

Notice we recover the same coefficients as before. We could use these inferred coefficients to answer questions about how probabilities of outcomes varies with changes in variables in the data.

Omitting a Variable, Again

Now, let’s try to (and fail to) repeat our omitted variable experiment.

First we confirm omitted is mean zero and uncorrelated with our variable x, even in the new data set and new row weight distribution.

# check mean zero
sum(d_logistic$omitted * d_logistic$wt) / sum(d_logistic$wt)
## [1] 1.50162e-17
# check uncorrelated
knitr::kable(
  cov.wt(
    d_logistic[, c('x', 'omitted')],
    wt = d_logistic$wt
  )$cov
)
x omitted
x 2.882739 0.000000
omitted 0.000000 1.281217

We pass the check. But, as we will see, this doesn’t guarantee non-entangled behavior for a logistic regression.

# infer coefficients from binary outcome, with omitted variable
# suppressWarnings() only to avoid "fractional weights message"
suppressWarnings(
  glm(
    y_observed ~ x, 
    data = d_logistic, 
    weights = d_logistic$wt, 
    family = binomial(link = "logit")
    )$coef
)
## (Intercept)           x 
##  0.00337503  1.85221234

Notice the new x coefficient is nowhere near the value we saw before.

Explaining The Result

A stern way of interpreting our logistic experiment is:

For a logistic regression model: an omitted explanatory variable can bias other coefficient estimates. This is true even when the omitted explanatory variable is mean zero, symmetric, and uncorrelated with the other model explanatory variables. This differs from the situation for linear models.

Another way of interpreting our logistic experiment is:

For a logistic regression model: the correct inference for a given explanatory variable coefficient often depends on what other explanatory variables are present in the model.

That is: we didn’t get a wrong inference. We just got a different one, as we are inferring in a different situation. The fallacy was thinking a change in variable value has the same effect no matter what the values of other explanatory variables are. This is not the case for logistic regression, due to the non-linear shape of the logistic curve.

Diagrammatically what happened is the following.

Unnamed chunk 22 1

In the above diagram we portray the sigmoid(), or logistic curve. The horizontal axis is the linear or “link space” for predictions and the vertical axis is the probability or response space for predictions. The curve is the transform the logistic regression’s linear or link prediction is run through to get probabilities or responses. On this curve we have added as dots the four different combinations of values for x and omitted in our data set. The dots attached by lines differ only by changes in omitted, i.e. those that have given value for x.

Without the extra variable omitted we can’t tell the joined pairs apart, and we are forced to use compromise effect estimates. However, the amount of interference is different for each value of x. For x = -2, the probability is almost determined, and omitted changes little. For x = 1 things are less determined, and omitted can have a substantial effect. How much probability effect omitted has depends on the value of x, which obscures results much like a statistical interaction would.

This is a common observation in logistic regression: you can’t tell if a variable and coefficient have large or small effects without knowing the specific values of the complementary explanatoryvariables.

My Interpretation

You get different estimates for variables depending on what other variables are present in a logistic regression model. This looks a lot like an interaction, and leads to effects similar to omitted variable bias. This happens more often than in linear regression models. This is also interpretable as: different column-views of the data having fundamentally different models.

A possible source of surprise is: appealing to assumed independence is a common way of assuring one is avoiding issues such as Simpson’s paradox in linear regression modeling. Thus it is possible an “independence implies non-interference” intuition is part of some modeler’s toolboxes.

In conclusion: care has to be taken in taking inferred logistic coefficients out of their surrounding context. The product of a logistic regression coefficient and matching value is not directly an effect size outside of context, this differs from the case for linear regression.

Discussion Points

What are your opinions/experience? Some questions I feel are relevant include:

  • What is the correct value of the x-coefficient in the logistic regressions? 3.1415, 1.8522, both, or neither?
  • Do you feel these effects are intrinsic to the modeling process, or introduced by attempted interpretation?
  • Is the above important in your uses of inferred logistic regression coefficients? Is this something you guard against in your work?

R source for this article can be found here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win Vector LLC.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)