# Stop using bivariate correlations for variable selection

March 19, 2014
By

(This article was first published on jacobsimmering.com, and kindly contributed to R-bloggers)

# Stop using bivariate correlations for variable selection

Something I've never understood is the widespread calculation and reporting of
univariate and bivariate statistics in applied work, especially when it comes
to model selection. Bivariate statistics are, at best, useless for
multi-variate model selection and, at worst, harmful. Since nearly all
questions of interest are highly dimensional, the use of bivariate statistics
is worrisome.

What typically happens is a researcher sits down with their statistical software
of choice and they compute a correlation between their response variable and
their collection of possible predictors. From here, they toss out potential
predictors that either have low correlation or for which the correlation is not
significant.

The concern here is that it is possible for the correlation between the marginal
distributions of the response and a predictor to be almost zero or
non-significant and for that predictor to be an important element in the data
generating pathway.

Consider the following simulation that illustrates the difference between the
marginal relationship between the response and a predictor and the true multi-
variable relationship. The function genY generates the response using user
provided values for $$\beta$$ and adds in a little white noise.

genY <- function(betas, X, stdPercent = 0.25) {
modelMatrix <- cbind(1, X)
yhat <- modelMatrix %*% betas
y <- yhat + rnorm(nrow(yhat), 0, stdPercent * mean(yhat))
y
}


The function cor returns the correlation between two values and works with
a vector and a matrix argument but cor.test requires two vectors. For this
I wrote a simple function sigCor that returns TRUE if a correlation between
the vector x and y is significant at a user provided threshold and FALSE
otherwise. This function is designed to work with apply to filter our
predictor set.

sigCor <- function(x, y, threshold = 0.2) {
p <- cor.test(x, y)$p.value p <= threshold }  The main function then generates a random covariance matrix and draws from a multivariate normal centered at 0 with that covariance matrix. The values of the matrix X are then checked for correlation with the response y. If the absolute value of the correlation exceeds some threshold or if the p value for the significance of the correlation is less than some threshold, the variable is kept. Otherwise it is dropped. The function returns the number of retained variables for each selection method. simulateSelection <- function(nx, n, corLim = 0.2, pLim = 0.2) { require(MASS) require(clusterGeneration) betas <- rep(1, nx + 1) mu <- rep(0, nx) sigma <- genPositiveDefMat(nx, covMethod = "unifcorrmat")$Sigma
X <- mvrnorm(n, mu, sigma)
y <- genY(betas, X)
xcor <- cor(y, X)
limitedX <- X[, abs(xcor) >= corLim]
pX <- X[, apply(X, 2, sigCor, y, pLim)]
nCoef <- c(ncol(X) + 1, ncol(limitedX) + 1, ncol(pX) + 1)
}


If we keep the defaults of corLim = 0.2 and pLim = 0.2 (keep any correlation
where the absolute value exceeds 0.2 and any correlation with a p value less
than 0.2) and use replicate to run the function 1000 times with 10 predictors
and $$n = 1000$$, we can calculate the probability that the different method
selects the correct model. (Note that we set all of values of betas to 1, so
all 11 coefficient estimates should be non-zero and all of the predictors are
correlated with the outcome).

results <- replicate(1000, (simulateSelection(10, 1000)))
correctModelFraction <- apply(results == 11, 1, mean)
correctModelFraction

## [1] 1.000 0.026 0.573


Using a filter based on the size of the correlation only yields the correct
model 2.6% of the time. Using a relatively
high p value as a threshold does better returning the correct model
57.3% of the time.

We can explore the effect of varying the value of pLim using a few simple
modifications to the above code.

simulateSelectionP <- function(nx, n, pLim) {
require(MASS)
require(clusterGeneration)
betas <- rep(1, nx + 1)
mu <- rep(0, nx)
sigma <- genPositiveDefMat(nx, covMethod = "unifcorrmat")$Sigma X <- mvrnorm(n, mu, sigma) y <- genY(betas, X) pX <- X[, apply(X, 2, sigCor, y, pLim)] correct <- (ncol(pX)) == nx } varyP <- function(nx, n, p) { fractionCorrect <- numeric(100 * length(p)) j <- 1 for (i in 1:length(p)) { set.seed(42) fractionCorrect[(j):(j + 99)] <- replicate(100, simulateSelectionP(nx, n, p[i])) j <- j + 100 } fractionCorrect }  Using these new functions, we can calculate the fraction of the time that a threshold of p yields the correct model. If we vary p from $$0.01--0.50$$ and plot the resulting curve, prCorrect <- varyP(10, 1000, seq(0.01, 0.5, length.out = 20)) results <- data.frame(prCorrect = prCorrect, limit = rep(seq(0.01, 0.5, length.out = 20), each = 100)) meanByLimit <- aggregate(results$prCorrect, list(results$limit), mean) sdByLimit <- aggregate(results$prCorrect, list(results$limit), sd) meanByLimit$sd <- sdByLimit$x meanByLimit$upper <- meanByLimit$x + qnorm(0.975) * meanByLimit$sd/sqrt(100)
meanByLimit$lower <- meanByLimit$x - qnorm(0.975) * meanByLimit\$sd/sqrt(100)
ggplot(meanByLimit, aes(x = Group.1, y = x, ymin = lower, ymax = upper)) + geom_line() +
geom_ribbon(alpha = 0.25) + theme_bw() + scale_x_continuous("Threshold for p value") +
scale_y_continuous("Fraction of models in simulation that are correct")


we see that, as expected, the probability of getting the correct model goes up
as we increase the threshold value. We have about a 75% shot at getting the
right model using this method when we set the limit on the p value to be 0.50
compared to under 30% at a p value of 0.01.

However, this raises an interesting question: if we increase the threshold for
the p value, what is the value of using the bivariate correlation at all?

randomRate <- function(n, pLim) {
x <- rnorm(n)
y <- rnorm(n)
sigCor(x, y, pLim)
}
p <- seq(0.01, 0.5, length.out = 20)
rate <- numeric(length(p))
for (i in 1:length(p)) {
rate[i] <- mean(replicate(1000, randomRate(1000, p[i])))
}
rbind(p, rate)

##       [,1]    [,2]    [,3]    [,4]   [,5]   [,6]   [,7]   [,8]   [,9]
## p    0.010 0.03579 0.06158 0.08737 0.1132 0.1389 0.1647 0.1905 0.2163
## rate 0.009 0.03900 0.06400 0.10100 0.1110 0.1290 0.1660 0.1870 0.2210
##       [,10]  [,11]  [,12]  [,13]  [,14]  [,15]  [,16]  [,17]  [,18]  [,19]
## p    0.2421 0.2679 0.2937 0.3195 0.3453 0.3711 0.3968 0.4226 0.4484 0.4742
## rate 0.2250 0.2920 0.3180 0.3180 0.3210 0.4010 0.3720 0.4560 0.4130 0.4660
##      [,20]
## p      0.5
## rate   0.5


As the threshold increases, the value of the bivariate correlation as a filter
goes down and goes down fast. At the limit required for there to be a 50% chance
of the model containing all relevant predictors, the rate of improperly included
variables would go up to nearly 20%.

The real problem comes into play here — the bivariate comparsions selects for
the wrong variables by over-emphasizing the relationship between the marginal
distributions. Because the selected model is incomplete and important
variables are omitted, the resulting parameter estimates are biased and
inaccurate.

The bivariate comparsion is a terrible way to select relevant variables for a
highly dimensional model as the function of interest is relating the all
of the predictors to the outcome. It can neither rule in nor rule out a
predictor from a model. A better approach is to generate the model using
domain knowledge and a model of the expected data generating process. If the
process is to remain strictly data driven, methods like
cross-validation or AIC/BIC provide better measures of model quality and
predictor importance than the bivariate correlation coefficient.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...