# Confounding & collinearity

March 11, 2014
By

(This article was first published on MATHEMATICS IN MEDICINE, and kindly contributed to R-bloggers)

Confounding and collinearity

# Confounding and collinearity

## Introduction

In this blog article, I will discuss about the bias introduced in estimation of coefficient of a given explanatory variable due to the presence of confounding factors. After that, I will try to demonstrate about the effect of variable collinearity on estimation of coefficient.

## Underlying structure

Let us assume that we know about the underlying relationship between the explanatory variable (E), outcome variable (O), confounder ( C ), variable correlated with E (VE) and variable correlated with O (VO).

Let us assume that the real underlying relation between various variables are as given below:

``O = 2.E + 2.C + 2.VO + 0.VE + errorE, C and VE are correlated with each other with correlation coef of 0.7 each.All the random variables are normally distributed, with mean 0 and sd 5.  error is distributed as standard normal distribution.``

The directed acyclic graph depicts the relationship between the variables.

## Samples for simulation

``library(MASS)library(dplyr)``
``## ## Attaching package: 'dplyr'## ## The following objects are masked from 'package:Hmisc':## ##     src, summarize## ## The following objects are masked from 'package:lubridate':## ##     intersect, setdiff, union## ## The following object is masked from 'package:MASS':## ##     select## ## The following objects are masked from 'package:stats':## ##     filter, lag## ## The following objects are masked from 'package:base':## ##     intersect, setdiff, setequal, union``
``set.seed(100)sam <- mvrnorm(n = 10000, mu = c(0, 0, 0), Sigma = matrix(c(5, 4, 4, 4, 5, 4,     4, 4, 5), 3, 3, byrow = T))df <- data.frame(E = sam[, 1], C = sam[, 2], VE = sam[, 3])df <- mutate(df, VO = rnorm(10000, 0, 5), O = 2 * E + 2 * C + 2 * VO + rnorm(10000))``

## Scenarios

### Scenario 1 (Only E as covariate, typical univariate analysis)

``mod.E <- lm(O ~ E, data = df)summary(mod.E)``
``## ## Call:## lm(formula = O ~ E, data = df)## ## Residuals:##    Min     1Q Median     3Q    Max ## -40.47  -7.05  -0.05   7.01  45.04 ## ## Coefficients:##             Estimate Std. Error t value Pr(>|t|)    ## (Intercept)   0.0069     0.1034    0.07     0.95    ## E             3.6420     0.0465   78.40   <2e-16 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 10.3 on 9998 degrees of freedom## Multiple R-squared:  0.381,  Adjusted R-squared:  0.381 ## F-statistic: 6.15e+03 on 1 and 9998 DF,  p-value: <2e-16``
``confint(mod.E)``
``##               2.5 % 97.5 %## (Intercept) -0.1959 0.2097## E            3.5509 3.7330``

The model overestimates coefficient of E, its value is the sum of coef of E and C. This type of bias is known as bias due to confounding and occurs when confounders are not included as co-variates.

### Scenario 2 (E and C as covariates)

This model is incorrect as it does not include the other covariate, VO.

``mod.EC <- lm(O ~ E + C, data = df)summary(mod.EC)``
``## ## Call:## lm(formula = O ~ E + C, data = df)## ## Residuals:##    Min     1Q Median     3Q    Max ## -35.99  -6.75  -0.11   6.71  43.32 ## ## Coefficients:##             Estimate Std. Error t value Pr(>|t|)    ## (Intercept)  -0.0194     0.0998   -0.19     0.85    ## E             2.0293     0.0741   27.40   <2e-16 ***## C             2.0231     0.0740   27.35   <2e-16 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 9.98 on 9997 degrees of freedom## Multiple R-squared:  0.424,  Adjusted R-squared:  0.424 ## F-statistic: 3.68e+03 on 2 and 9997 DF,  p-value: <2e-16``
``confint(mod.EC)``
``##              2.5 % 97.5 %## (Intercept) -0.215 0.1762## E            1.884 2.1745## C            1.878 2.1681``

Due to addition of C, the estimate of E is unbiased. Exclusion of VO has not influenced the coefficient of E.

### Scenario 3 (E and VO as covariates)

E and VO are not correlated. C has been excluded from the model.

``mod.EVO <- lm(O ~ E + VO, data = df)summary(mod.EVO)``
``## ## Call:## lm(formula = O ~ E + VO, data = df)## ## Residuals:##    Min     1Q Median     3Q    Max ## -9.638 -1.938  0.005  1.954 12.780 ## ## Coefficients:##             Estimate Std. Error t value Pr(>|t|)    ## (Intercept)  0.03893    0.02884    1.35     0.18    ## E            3.59175    0.01295  277.27   <2e-16 ***## VO           2.00167    0.00581  344.39   <2e-16 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 2.88 on 9997 degrees of freedom## Multiple R-squared:  0.952,  Adjusted R-squared:  0.952 ## F-statistic: 9.88e+04 on 2 and 9997 DF,  p-value: <2e-16``
``confint(mod.EVO)``
``##                2.5 %  97.5 %## (Intercept) -0.01761 0.09546## E            3.56636 3.61714## VO           1.99028 2.01307``

There is again overestimation of coefficient of E (as in scenario 1), due to exclusion of C. Estimate of VO is unbiased as it has no correlation with either E, VO or VE.

### Scenario 4 (E and VE as covariates)

E and VE are correlated, but VE is not associated with O.

``mod.EVE <- lm(O ~ E + VE, data = df)summary(mod.EVE)``
``## ## Call:## lm(formula = O ~ E + VE, data = df)## ## Residuals:##    Min     1Q Median     3Q    Max ## -41.55  -7.00  -0.02   6.93  43.46 ## ## Coefficients:##             Estimate Std. Error t value Pr(>|t|)    ## (Intercept)   0.0106     0.1026     0.1     0.92    ## E             2.8584     0.0761    37.6   <2e-16 ***## VE            0.9857     0.0761    12.9   <2e-16 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 10.3 on 9997 degrees of freedom## Multiple R-squared:  0.391,  Adjusted R-squared:  0.391 ## F-statistic: 3.21e+03 on 2 and 9997 DF,  p-value: <2e-16``
``confint(mod.EVE)``
``##               2.5 % 97.5 %## (Intercept) -0.1905 0.2117## E            2.7093 3.0075## VE           0.8364 1.1350``

Estimates of both E and VE are biased. But coefficient of E is less biased than scenario 1. Reason for biasness of coefficient of VE is unknown to me.

### Scenario 5 (E, VE, VO, C as covariates) (Complete model)

``mod.comp <- lm(O ~ E + C + VO + VE, data = df)summary(mod.comp)``
``## ## Call:## lm(formula = O ~ E + C + VO + VE, data = df)## ## Residuals:##    Min     1Q Median     3Q    Max ## -3.434 -0.677 -0.014  0.678  3.616 ## ## Coefficients:##             Estimate Std. Error t value Pr(>|t|)    ## (Intercept)  0.01297    0.01000    1.30     0.19    ## E            1.98794    0.00826  240.62   <2e-16 ***## C            1.99993    0.00830  240.81   <2e-16 ***## VO           2.00033    0.00202  992.51   <2e-16 ***## VE           0.01214    0.00831    1.46     0.14    ## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 1 on 9995 degrees of freedom## Multiple R-squared:  0.994,  Adjusted R-squared:  0.994 ## F-statistic: 4.29e+05 on 4 and 9995 DF,  p-value: <2e-16``
``confint(mod.comp)``
``##                 2.5 %  97.5 %## (Intercept) -0.006635 0.03258## E            1.971745 2.00413## C            1.983651 2.01621## VO           1.996381 2.00428## VE          -0.004159 0.02844``

Unbiased coefficients of all the covariates.

### Scenario 6 (E, VE, C as covariates)

``mod.EVEC <- lm(O ~ E + VE + C, data = df)summary(mod.EVEC)``
``## ## Call:## lm(formula = O ~ E + VE + C, data = df)## ## Residuals:##    Min     1Q Median     3Q    Max ## -36.18  -6.74  -0.08   6.71  43.21 ## ## Coefficients:##             Estimate Std. Error t value Pr(>|t|)    ## (Intercept)  -0.0185     0.0998   -0.19     0.85    ## E             1.9892     0.0824   24.13   <2e-16 ***## VE            0.0920     0.0830    1.11     0.27    ## C             1.9817     0.0829   23.92   <2e-16 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 9.98 on 9996 degrees of freedom## Multiple R-squared:  0.424,  Adjusted R-squared:  0.424 ## F-statistic: 2.45e+03 on 3 and 9996 DF,  p-value: <2e-16``
``confint(mod.EVEC)``
``##                2.5 % 97.5 %## (Intercept) -0.21411 0.1771## E            1.82758 2.1507## VE          -0.07061 0.2546## C            1.81930 2.1441``

Addition of C has removed the biasedness from the coefficients of E and VE (see Scenario 4).

### Scenario 7 (E, VO and C as covariates)

``mod.EVOC <- lm(O ~ E + VO + C, data = df)summary(mod.EVOC)``
``## ## Call:## lm(formula = O ~ E + VO + C, data = df)## ## Residuals:##    Min     1Q Median     3Q    Max ## -3.427 -0.676 -0.014  0.680  3.641 ## ## Coefficients:##             Estimate Std. Error t value Pr(>|t|)    ## (Intercept)  0.01285    0.01000    1.29      0.2    ## E            1.99323    0.00742  268.50   <2e-16 ***## VO           2.00036    0.00202  992.52   <2e-16 ***## C            2.00539    0.00741  270.45   <2e-16 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 1 on 9996 degrees of freedom## Multiple R-squared:  0.994,  Adjusted R-squared:  0.994 ## F-statistic: 5.72e+05 on 3 and 9996 DF,  p-value: <2e-16``
``confint(mod.EVOC)``
``##                 2.5 %  97.5 %## (Intercept) -0.006751 0.03246## E            1.978682 2.00779## VO           1.996410 2.00431## C            1.990857 2.01993``

Performance of this model is almost the same as Scenario 5

## Performance of different scenarios

Models Complete (E, C, VO, VE) E, C, VE E, C, VO E, C E, VE E, VO E only
E (2) 1.9879, 0.0083 1.9892, 0.0824 1.9932, 0.0074 2.0293, 0.0741 2.8584, 0.0761 3.5918, 0.013 3.642, 0.0465
C (2) 1.9999, 0.0083 1.9817, 0.0829 2.0054, 0.0074 2.0231, 0.074 - - -
VE (0) 0.0121, 0.0083 0.092, 0.083 - - 0.9857, 0.0761 - -
VO (2) 2.0003, 0.002 - 2.0004, 0.002 - - 2.0017, 0.0058 -

E (2): means covariate E and its actual value.

–,–: coefficient, standard error.

## Conclusions

Models with C, will have unbiased estimate of E. Models with correlated variables (E, C, VE) has larger SE due to near collinearity.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...