**MATHEMATICS IN MEDICINE**, and kindly contributed to R-bloggers)

# Confounding and collinearity

## Introduction

In this blog article, I will discuss about the bias introduced in estimation of coefficient of a given explanatory variable due to the presence of confounding factors. After that, I will try to demonstrate about the effect of variable collinearity on estimation of coefficient.

## Underlying structure

Let us assume that we know about the underlying relationship between the explanatory variable (E), outcome variable (O), confounder ( C ), variable correlated with E (VE) and variable correlated with O (VO).

Let us assume that the real underlying relation between various variables are as given below:

`O = 2.E + 2.C + 2.VO + 0.VE + error`

E, C and VE are correlated with each other with correlation coef of 0.7 each.

All the random variables are normally distributed, with mean 0 and sd 5. error is distributed as standard normal distribution.

The directed acyclic graph depicts the relationship between the variables.

## Samples for simulation

`library(MASS)`

library(dplyr)

`## `

## Attaching package: 'dplyr'

##

## The following objects are masked from 'package:Hmisc':

##

## src, summarize

##

## The following objects are masked from 'package:lubridate':

##

## intersect, setdiff, union

##

## The following object is masked from 'package:MASS':

##

## select

##

## The following objects are masked from 'package:stats':

##

## filter, lag

##

## The following objects are masked from 'package:base':

##

## intersect, setdiff, setequal, union

`set.seed(100)`

sam <- mvrnorm(n = 10000, mu = c(0, 0, 0), Sigma = matrix(c(5, 4, 4, 4, 5, 4,

4, 4, 5), 3, 3, byrow = T))

df <- data.frame(E = sam[, 1], C = sam[, 2], VE = sam[, 3])

df <- mutate(df, VO = rnorm(10000, 0, 5), O = 2 * E + 2 * C + 2 * VO + rnorm(10000))

## Scenarios

### Scenario 1 (Only E as covariate, typical univariate analysis)

`mod.E <- lm(O ~ E, data = df)`

summary(mod.E)

`## `

## Call:

## lm(formula = O ~ E, data = df)

##

## Residuals:

## Min 1Q Median 3Q Max

## -40.47 -7.05 -0.05 7.01 45.04

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 0.0069 0.1034 0.07 0.95

## E 3.6420 0.0465 78.40 <2e-16 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 10.3 on 9998 degrees of freedom

## Multiple R-squared: 0.381, Adjusted R-squared: 0.381

## F-statistic: 6.15e+03 on 1 and 9998 DF, p-value: <2e-16

`confint(mod.E)`

`## 2.5 % 97.5 %`

## (Intercept) -0.1959 0.2097

## E 3.5509 3.7330

The model overestimates coefficient of E, its value is the sum of coef of E and C. This type of bias is known as **bias due to confounding** and occurs when confounders are not included as co-variates.

### Scenario 2 (E and C as covariates)

This model is incorrect as it does not include the other covariate, VO.

`mod.EC <- lm(O ~ E + C, data = df)`

summary(mod.EC)

`## `

## Call:

## lm(formula = O ~ E + C, data = df)

##

## Residuals:

## Min 1Q Median 3Q Max

## -35.99 -6.75 -0.11 6.71 43.32

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) -0.0194 0.0998 -0.19 0.85

## E 2.0293 0.0741 27.40 <2e-16 ***

## C 2.0231 0.0740 27.35 <2e-16 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 9.98 on 9997 degrees of freedom

## Multiple R-squared: 0.424, Adjusted R-squared: 0.424

## F-statistic: 3.68e+03 on 2 and 9997 DF, p-value: <2e-16

`confint(mod.EC)`

`## 2.5 % 97.5 %`

## (Intercept) -0.215 0.1762

## E 1.884 2.1745

## C 1.878 2.1681

Due to addition of C, the estimate of E is unbiased. Exclusion of VO has not influenced the coefficient of E.

### Scenario 3 (E and VO as covariates)

E and VO are not correlated. C has been excluded from the model.

`mod.EVO <- lm(O ~ E + VO, data = df)`

summary(mod.EVO)

`## `

## Call:

## lm(formula = O ~ E + VO, data = df)

##

## Residuals:

## Min 1Q Median 3Q Max

## -9.638 -1.938 0.005 1.954 12.780

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 0.03893 0.02884 1.35 0.18

## E 3.59175 0.01295 277.27 <2e-16 ***

## VO 2.00167 0.00581 344.39 <2e-16 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 2.88 on 9997 degrees of freedom

## Multiple R-squared: 0.952, Adjusted R-squared: 0.952

## F-statistic: 9.88e+04 on 2 and 9997 DF, p-value: <2e-16

`confint(mod.EVO)`

`## 2.5 % 97.5 %`

## (Intercept) -0.01761 0.09546

## E 3.56636 3.61714

## VO 1.99028 2.01307

There is again overestimation of coefficient of E (as in scenario 1), due to exclusion of C. Estimate of VO is unbiased as it has no correlation with either E, VO or VE.

### Scenario 4 (E and VE as covariates)

E and VE are correlated, but VE is not associated with O.

`mod.EVE <- lm(O ~ E + VE, data = df)`

summary(mod.EVE)

`## `

## Call:

## lm(formula = O ~ E + VE, data = df)

##

## Residuals:

## Min 1Q Median 3Q Max

## -41.55 -7.00 -0.02 6.93 43.46

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 0.0106 0.1026 0.1 0.92

## E 2.8584 0.0761 37.6 <2e-16 ***

## VE 0.9857 0.0761 12.9 <2e-16 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 10.3 on 9997 degrees of freedom

## Multiple R-squared: 0.391, Adjusted R-squared: 0.391

## F-statistic: 3.21e+03 on 2 and 9997 DF, p-value: <2e-16

`confint(mod.EVE)`

`## 2.5 % 97.5 %`

## (Intercept) -0.1905 0.2117

## E 2.7093 3.0075

## VE 0.8364 1.1350

Estimates of both E and VE are biased. But coefficient of E is less biased than scenario 1. Reason for biasness of coefficient of VE is unknown to me.

### Scenario 5 (E, VE, VO, C as covariates) (Complete model)

`mod.comp <- lm(O ~ E + C + VO + VE, data = df)`

summary(mod.comp)

`## `

## Call:

## lm(formula = O ~ E + C + VO + VE, data = df)

##

## Residuals:

## Min 1Q Median 3Q Max

## -3.434 -0.677 -0.014 0.678 3.616

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 0.01297 0.01000 1.30 0.19

## E 1.98794 0.00826 240.62 <2e-16 ***

## C 1.99993 0.00830 240.81 <2e-16 ***

## VO 2.00033 0.00202 992.51 <2e-16 ***

## VE 0.01214 0.00831 1.46 0.14

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 1 on 9995 degrees of freedom

## Multiple R-squared: 0.994, Adjusted R-squared: 0.994

## F-statistic: 4.29e+05 on 4 and 9995 DF, p-value: <2e-16

`confint(mod.comp)`

`## 2.5 % 97.5 %`

## (Intercept) -0.006635 0.03258

## E 1.971745 2.00413

## C 1.983651 2.01621

## VO 1.996381 2.00428

## VE -0.004159 0.02844

Unbiased coefficients of all the covariates.

### Scenario 6 (E, VE, C as covariates)

`mod.EVEC <- lm(O ~ E + VE + C, data = df)`

summary(mod.EVEC)

`## `

## Call:

## lm(formula = O ~ E + VE + C, data = df)

##

## Residuals:

## Min 1Q Median 3Q Max

## -36.18 -6.74 -0.08 6.71 43.21

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) -0.0185 0.0998 -0.19 0.85

## E 1.9892 0.0824 24.13 <2e-16 ***

## VE 0.0920 0.0830 1.11 0.27

## C 1.9817 0.0829 23.92 <2e-16 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 9.98 on 9996 degrees of freedom

## Multiple R-squared: 0.424, Adjusted R-squared: 0.424

## F-statistic: 2.45e+03 on 3 and 9996 DF, p-value: <2e-16

`confint(mod.EVEC)`

`## 2.5 % 97.5 %`

## (Intercept) -0.21411 0.1771

## E 1.82758 2.1507

## VE -0.07061 0.2546

## C 1.81930 2.1441

Addition of C has removed the biasedness from the coefficients of E and VE (see Scenario 4).

### Scenario 7 (E, VO and C as covariates)

`mod.EVOC <- lm(O ~ E + VO + C, data = df)`

summary(mod.EVOC)

`## `

## Call:

## lm(formula = O ~ E + VO + C, data = df)

##

## Residuals:

## Min 1Q Median 3Q Max

## -3.427 -0.676 -0.014 0.680 3.641

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 0.01285 0.01000 1.29 0.2

## E 1.99323 0.00742 268.50 <2e-16 ***

## VO 2.00036 0.00202 992.52 <2e-16 ***

## C 2.00539 0.00741 270.45 <2e-16 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 1 on 9996 degrees of freedom

## Multiple R-squared: 0.994, Adjusted R-squared: 0.994

## F-statistic: 5.72e+05 on 3 and 9996 DF, p-value: <2e-16

`confint(mod.EVOC)`

`## 2.5 % 97.5 %`

## (Intercept) -0.006751 0.03246

## E 1.978682 2.00779

## VO 1.996410 2.00431

## C 1.990857 2.01993

Performance of this model is almost the same as Scenario 5

## Performance of different scenarios

Models | Complete (E, C, VO, VE) | E, C, VE | E, C, VO | E, C | E, VE | E, VO | E only |
---|---|---|---|---|---|---|---|

E (2) | 1.9879, 0.0083 | 1.9892, 0.0824 | 1.9932, 0.0074 | 2.0293, 0.0741 | 2.8584, 0.0761 | 3.5918, 0.013 | 3.642, 0.0465 |

C (2) | 1.9999, 0.0083 | 1.9817, 0.0829 | 2.0054, 0.0074 | 2.0231, 0.074 | - | - | - |

VE (0) | 0.0121, 0.0083 | 0.092, 0.083 | - | - | 0.9857, 0.0761 | - | - |

VO (2) | 2.0003, 0.002 | - | 2.0004, 0.002 | - | - | 2.0017, 0.0058 | - |

E (2): means covariate E and its actual value.

–,–: coefficient, standard error.

## Conclusions

Models with C, will have unbiased estimate of E. Models with correlated variables (E, C, VE) has larger SE due to near collinearity.

**leave a comment**for the author, please follow the link and comment on their blog:

**MATHEMATICS IN MEDICINE**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...