Confounding & collinearity

[This article was first published on MATHEMATICS IN MEDICINE, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Confounding and collinearity

Confounding and collinearity

Introduction

In this blog article, I will discuss about the bias introduced in estimation of coefficient of a given explanatory variable due to the presence of confounding factors. After that, I will try to demonstrate about the effect of variable collinearity on estimation of coefficient.

Underlying structure

Let us assume that we know about the underlying relationship between the explanatory variable (E), outcome variable (O), confounder ( C ), variable correlated with E (VE) and variable correlated with O (VO).

Let us assume that the real underlying relation between various variables are as given below:

O = 2.E + 2.C + 2.VO + 0.VE + error

E, C and VE are correlated with each other with correlation coef of 0.7 each.

All the random variables are normally distributed, with mean 0 and sd 5.  error is distributed as standard normal distribution.

The directed acyclic graph depicts the relationship between the variables.

Underlying structure

Samples for simulation

library(MASS)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## 
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## 
## The following object is masked from 'package:MASS':
## 
##     select
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

set.seed(100)
sam <- mvrnorm(n = 10000, mu = c(0, 0, 0), Sigma = matrix(c(5, 4, 4, 4, 5, 4, 
    4, 4, 5), 3, 3, byrow = T))
df <- data.frame(E = sam[, 1], C = sam[, 2], VE = sam[, 3])
df <- mutate(df, VO = rnorm(10000, 0, 5), O = 2 * E + 2 * C + 2 * VO + rnorm(10000))

Scenarios

Scenario 1 (Only E as covariate, typical univariate analysis)

mod.E <- lm(O ~ E, data = df)
summary(mod.E)

## 
## Call:
## lm(formula = O ~ E, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -40.47  -7.05  -0.05   7.01  45.04 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.0069     0.1034    0.07     0.95    
## E             3.6420     0.0465   78.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.3 on 9998 degrees of freedom
## Multiple R-squared:  0.381,  Adjusted R-squared:  0.381 
## F-statistic: 6.15e+03 on 1 and 9998 DF,  p-value: <2e-16

confint(mod.E)

##               2.5 % 97.5 %
## (Intercept) -0.1959 0.2097
## E            3.5509 3.7330

The model overestimates coefficient of E, its value is the sum of coef of E and C. This type of bias is known as bias due to confounding and occurs when confounders are not included as co-variates.

Scenario 2 (E and C as covariates)

This model is incorrect as it does not include the other covariate, VO.

mod.EC <- lm(O ~ E + C, data = df)
summary(mod.EC)

## 
## Call:
## lm(formula = O ~ E + C, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -35.99  -6.75  -0.11   6.71  43.32 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.0194     0.0998   -0.19     0.85    
## E             2.0293     0.0741   27.40   <2e-16 ***
## C             2.0231     0.0740   27.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.98 on 9997 degrees of freedom
## Multiple R-squared:  0.424,  Adjusted R-squared:  0.424 
## F-statistic: 3.68e+03 on 2 and 9997 DF,  p-value: <2e-16

confint(mod.EC)

##              2.5 % 97.5 %
## (Intercept) -0.215 0.1762
## E            1.884 2.1745
## C            1.878 2.1681

Due to addition of C, the estimate of E is unbiased. Exclusion of VO has not influenced the coefficient of E.

Scenario 3 (E and VO as covariates)

E and VO are not correlated. C has been excluded from the model.

mod.EVO <- lm(O ~ E + VO, data = df)
summary(mod.EVO)

## 
## Call:
## lm(formula = O ~ E + VO, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.638 -1.938  0.005  1.954 12.780 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03893    0.02884    1.35     0.18    
## E            3.59175    0.01295  277.27   <2e-16 ***
## VO           2.00167    0.00581  344.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.88 on 9997 degrees of freedom
## Multiple R-squared:  0.952,  Adjusted R-squared:  0.952 
## F-statistic: 9.88e+04 on 2 and 9997 DF,  p-value: <2e-16

confint(mod.EVO)

##                2.5 %  97.5 %
## (Intercept) -0.01761 0.09546
## E            3.56636 3.61714
## VO           1.99028 2.01307

There is again overestimation of coefficient of E (as in scenario 1), due to exclusion of C. Estimate of VO is unbiased as it has no correlation with either E, VO or VE.

Scenario 4 (E and VE as covariates)

E and VE are correlated, but VE is not associated with O.

mod.EVE <- lm(O ~ E + VE, data = df)
summary(mod.EVE)

## 
## Call:
## lm(formula = O ~ E + VE, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41.55  -7.00  -0.02   6.93  43.46 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.0106     0.1026     0.1     0.92    
## E             2.8584     0.0761    37.6   <2e-16 ***
## VE            0.9857     0.0761    12.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.3 on 9997 degrees of freedom
## Multiple R-squared:  0.391,  Adjusted R-squared:  0.391 
## F-statistic: 3.21e+03 on 2 and 9997 DF,  p-value: <2e-16

confint(mod.EVE)

##               2.5 % 97.5 %
## (Intercept) -0.1905 0.2117
## E            2.7093 3.0075
## VE           0.8364 1.1350

Estimates of both E and VE are biased. But coefficient of E is less biased than scenario 1. Reason for biasness of coefficient of VE is unknown to me.

Scenario 5 (E, VE, VO, C as covariates) (Complete model)

mod.comp <- lm(O ~ E + C + VO + VE, data = df)
summary(mod.comp)

## 
## Call:
## lm(formula = O ~ E + C + VO + VE, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.434 -0.677 -0.014  0.678  3.616 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.01297    0.01000    1.30     0.19    
## E            1.98794    0.00826  240.62   <2e-16 ***
## C            1.99993    0.00830  240.81   <2e-16 ***
## VO           2.00033    0.00202  992.51   <2e-16 ***
## VE           0.01214    0.00831    1.46     0.14    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1 on 9995 degrees of freedom
## Multiple R-squared:  0.994,  Adjusted R-squared:  0.994 
## F-statistic: 4.29e+05 on 4 and 9995 DF,  p-value: <2e-16

confint(mod.comp)

##                 2.5 %  97.5 %
## (Intercept) -0.006635 0.03258
## E            1.971745 2.00413
## C            1.983651 2.01621
## VO           1.996381 2.00428
## VE          -0.004159 0.02844

Unbiased coefficients of all the covariates.

Scenario 6 (E, VE, C as covariates)

mod.EVEC <- lm(O ~ E + VE + C, data = df)
summary(mod.EVEC)

## 
## Call:
## lm(formula = O ~ E + VE + C, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -36.18  -6.74  -0.08   6.71  43.21 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.0185     0.0998   -0.19     0.85    
## E             1.9892     0.0824   24.13   <2e-16 ***
## VE            0.0920     0.0830    1.11     0.27    
## C             1.9817     0.0829   23.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.98 on 9996 degrees of freedom
## Multiple R-squared:  0.424,  Adjusted R-squared:  0.424 
## F-statistic: 2.45e+03 on 3 and 9996 DF,  p-value: <2e-16

confint(mod.EVEC)

##                2.5 % 97.5 %
## (Intercept) -0.21411 0.1771
## E            1.82758 2.1507
## VE          -0.07061 0.2546
## C            1.81930 2.1441

Addition of C has removed the biasedness from the coefficients of E and VE (see Scenario 4).

Scenario 7 (E, VO and C as covariates)

mod.EVOC <- lm(O ~ E + VO + C, data = df)
summary(mod.EVOC)

## 
## Call:
## lm(formula = O ~ E + VO + C, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.427 -0.676 -0.014  0.680  3.641 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.01285    0.01000    1.29      0.2    
## E            1.99323    0.00742  268.50   <2e-16 ***
## VO           2.00036    0.00202  992.52   <2e-16 ***
## C            2.00539    0.00741  270.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1 on 9996 degrees of freedom
## Multiple R-squared:  0.994,  Adjusted R-squared:  0.994 
## F-statistic: 5.72e+05 on 3 and 9996 DF,  p-value: <2e-16

confint(mod.EVOC)

##                 2.5 %  97.5 %
## (Intercept) -0.006751 0.03246
## E            1.978682 2.00779
## VO           1.996410 2.00431
## C            1.990857 2.01993

Performance of this model is almost the same as Scenario 5

Performance of different scenarios

ModelsComplete (E, C, VO, VE)E, C, VEE, C, VOE, CE, VEE, VOE only
E (2)1.9879, 0.00831.9892, 0.08241.9932, 0.00742.0293, 0.07412.8584, 0.07613.5918, 0.0133.642, 0.0465
C (2)1.9999, 0.00831.9817, 0.08292.0054, 0.00742.0231, 0.074---
VE (0)0.0121, 0.00830.092, 0.083--0.9857, 0.0761--
VO (2)2.0003, 0.002-2.0004, 0.002--2.0017, 0.0058-

E (2): means covariate E and its actual value.

–,–: coefficient, standard error.

Conclusions

Models with C, will have unbiased estimate of E. Models with correlated variables (E, C, VE) has larger SE due to near collinearity.

To leave a comment for the author, please follow the link and comment on their blog: MATHEMATICS IN MEDICINE.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)