**jacobsimmering.com**, and kindly contributed to R-bloggers)

# Stop using bivariate correlations for variable selection

Something I've never understood is the widespread calculation and reporting of

univariate and bivariate statistics in applied work, especially when it comes

to model selection. Bivariate statistics are, at best, useless for

multi-variate model selection and, at worst, harmful. Since nearly all

questions of interest are highly dimensional, the use of bivariate statistics

is worrisome.

What typically happens is a researcher sits down with their statistical software

of choice and they compute a correlation between their response variable and

their collection of possible predictors. From here, they toss out potential

predictors that either have low correlation or for which the correlation is not

significant.

The concern here is that it is possible for the correlation between the marginal

distributions of the response and a predictor to be almost zero or

non-significant and for that predictor to be an important element in the data

generating pathway.

Consider the following simulation that illustrates the difference between the

marginal relationship between the response and a predictor and the true multi-

variable relationship. The function `genY`

generates the response using user

provided values for \( \beta \) and adds in a little white noise.

```
genY <- function(betas, X, stdPercent = 0.25) {
modelMatrix <- cbind(1, X)
yhat <- modelMatrix %*% betas
y <- yhat + rnorm(nrow(yhat), 0, stdPercent * mean(yhat))
y
}
```

The function `cor`

returns the correlation between two values and works with

a vector and a matrix argument but `cor.test`

requires two vectors. For this

I wrote a simple function `sigCor`

that returns `TRUE`

if a correlation between

the vector `x`

and `y`

is significant at a user provided threshold and `FALSE`

otherwise. This function is designed to work with `apply`

to filter our

predictor set.

```
sigCor <- function(x, y, threshold = 0.2) {
p <- cor.test(x, y)$p.value
p <= threshold
}
```

The main function then generates a random covariance matrix and draws from a

multivariate normal centered at 0 with that covariance matrix. The values of

the matrix `X`

are then checked for correlation with the response `y`

. If the

absolute value of the correlation exceeds some threshold or if the p value for

the significance of the correlation is less than some threshold, the variable is

kept. Otherwise it is dropped. The function returns the number of retained

variables for each selection method.

```
simulateSelection <- function(nx, n, corLim = 0.2, pLim = 0.2) {
require(MASS)
require(clusterGeneration)
betas <- rep(1, nx + 1)
mu <- rep(0, nx)
sigma <- genPositiveDefMat(nx, covMethod = "unifcorrmat")$Sigma
X <- mvrnorm(n, mu, sigma)
y <- genY(betas, X)
xcor <- cor(y, X)
limitedX <- X[, abs(xcor) >= corLim]
pX <- X[, apply(X, 2, sigCor, y, pLim)]
nCoef <- c(ncol(X) + 1, ncol(limitedX) + 1, ncol(pX) + 1)
}
```

If we keep the defaults of `corLim = 0.2`

and `pLim = 0.2`

(keep any correlation

where the absolute value exceeds 0.2 and any correlation with a p value less

than 0.2) and use `replicate`

to run the function 1000 times with 10 predictors

and \( n = 1000 \), we can calculate the probability that the different method

selects the correct model. (Note that we set all of values of `betas`

to 1, so

all 11 coefficient estimates should be non-zero and all of the predictors are

correlated with the outcome).

```
results <- replicate(1000, (simulateSelection(10, 1000)))
correctModelFraction <- apply(results == 11, 1, mean)
correctModelFraction
```

```
## [1] 1.000 0.026 0.573
```

Using a filter based on the size of the correlation only yields the correct

model 2.6% of the time. Using a relatively

high p value as a threshold does better returning the correct model

57.3% of the time.

We can explore the effect of varying the value of `pLim`

using a few simple

modifications to the above code.

```
simulateSelectionP <- function(nx, n, pLim) {
require(MASS)
require(clusterGeneration)
betas <- rep(1, nx + 1)
mu <- rep(0, nx)
sigma <- genPositiveDefMat(nx, covMethod = "unifcorrmat")$Sigma
X <- mvrnorm(n, mu, sigma)
y <- genY(betas, X)
pX <- X[, apply(X, 2, sigCor, y, pLim)]
correct <- (ncol(pX)) == nx
}
varyP <- function(nx, n, p) {
fractionCorrect <- numeric(100 * length(p))
j <- 1
for (i in 1:length(p)) {
set.seed(42)
fractionCorrect[(j):(j + 99)] <- replicate(100, simulateSelectionP(nx,
n, p[i]))
j <- j + 100
}
fractionCorrect
}
```

Using these new functions, we can calculate the fraction of the time that a

threshold of `p`

yields the correct model. If we vary `p`

from \( 0.01--0.50 \) and

plot the resulting curve,

```
prCorrect <- varyP(10, 1000, seq(0.01, 0.5, length.out = 20))
results <- data.frame(prCorrect = prCorrect, limit = rep(seq(0.01, 0.5, length.out = 20),
each = 100))
meanByLimit <- aggregate(results$prCorrect, list(results$limit), mean)
sdByLimit <- aggregate(results$prCorrect, list(results$limit), sd)
meanByLimit$sd <- sdByLimit$x
meanByLimit$upper <- meanByLimit$x + qnorm(0.975) * meanByLimit$sd/sqrt(100)
meanByLimit$lower <- meanByLimit$x - qnorm(0.975) * meanByLimit$sd/sqrt(100)
ggplot(meanByLimit, aes(x = Group.1, y = x, ymin = lower, ymax = upper)) + geom_line() +
geom_ribbon(alpha = 0.25) + theme_bw() + scale_x_continuous("Threshold for p value") +
scale_y_continuous("Fraction of models in simulation that are correct")
```

we see that, as expected, the probability of getting the correct model goes up

as we increase the threshold value. We have about a 75% shot at getting the

right model using this method when we set the limit on the p value to be 0.50

compared to under 30% at a p value of 0.01.

However, this raises an interesting question: if we increase the threshold for

the p value, what is the value of using the bivariate correlation at all?

```
randomRate <- function(n, pLim) {
x <- rnorm(n)
y <- rnorm(n)
sigCor(x, y, pLim)
}
p <- seq(0.01, 0.5, length.out = 20)
rate <- numeric(length(p))
for (i in 1:length(p)) {
rate[i] <- mean(replicate(1000, randomRate(1000, p[i])))
}
rbind(p, rate)
```

```
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## p 0.010 0.03579 0.06158 0.08737 0.1132 0.1389 0.1647 0.1905 0.2163
## rate 0.009 0.03900 0.06400 0.10100 0.1110 0.1290 0.1660 0.1870 0.2210
## [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19]
## p 0.2421 0.2679 0.2937 0.3195 0.3453 0.3711 0.3968 0.4226 0.4484 0.4742
## rate 0.2250 0.2920 0.3180 0.3180 0.3210 0.4010 0.3720 0.4560 0.4130 0.4660
## [,20]
## p 0.5
## rate 0.5
```

As the threshold increases, the value of the bivariate correlation as a filter

goes down and goes down fast. At the limit required for there to be a 50% chance

of the model containing all relevant predictors, the rate of improperly included

variables would go up to nearly 20%.

The real problem comes into play here — the bivariate comparsions selects for

the wrong variables by over-emphasizing the relationship between the marginal

distributions. Because the selected model is incomplete and important

variables are omitted, the resulting parameter estimates are biased and

inaccurate.

The bivariate comparsion is a terrible way to select relevant variables for a

highly dimensional model as the function of interest is relating the all

of the predictors to the outcome. It can neither rule in nor rule out a

predictor from a model. A better approach is to generate the model using

domain knowledge and a model of the expected data generating process. If the

process is to remain strictly data driven, methods like

cross-validation or AIC/BIC provide better measures of model quality and

predictor importance than the bivariate correlation coefficient.

**leave a comment**for the author, please follow the link and comment on his blog:

**jacobsimmering.com**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...