Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Stop using bivariate correlations for variable selection

Something I've never understood is the widespread calculation and reporting of univariate and bivariate statistics in applied work, especially when it comes to model selection. Bivariate statistics are, at best, useless for multi-variate model selection and, at worst, harmful. Since nearly all questions of interest are highly dimensional, the use of bivariate statistics is worrisome.

What typically happens is a researcher sits down with their statistical software of choice and they compute a correlation between their response variable and their collection of possible predictors. From here, they toss out potential predictors that either have low correlation or for which the correlation is not significant.

The concern here is that it is possible for the correlation between the marginal distributions of the response and a predictor to be almost zero or non-significant and for that predictor to be an important element in the data generating pathway.

Consider the following simulation that illustrates the difference between the marginal relationship between the response and a predictor and the true multi- variable relationship. The function genY generates the response using user provided values for $$\beta$$ and adds in a little white noise.

genY <- function(betas, X, stdPercent = 0.25) {
modelMatrix <- cbind(1, X)
yhat <- modelMatrix %*% betas
y <- yhat + rnorm(nrow(yhat), 0, stdPercent * mean(yhat))
y
}


The function cor returns the correlation between two values and works with a vector and a matrix argument but cor.test requires two vectors. For this I wrote a simple function sigCor that returns TRUE if a correlation between the vector x and y is significant at a user provided threshold and FALSE otherwise. This function is designed to work with apply to filter our predictor set.

sigCor <- function(x, y, threshold = 0.2) {
p <- cor.test(x, y)$p.value p <= threshold }  The main function then generates a random covariance matrix and draws from a multivariate normal centered at 0 with that covariance matrix. The values of the matrix X are then checked for correlation with the response y. If the absolute value of the correlation exceeds some threshold or if the p value for the significance of the correlation is less than some threshold, the variable is kept. Otherwise it is dropped. The function returns the number of retained variables for each selection method. simulateSelection <- function(nx, n, corLim = 0.2, pLim = 0.2) { require(MASS) require(clusterGeneration) betas <- rep(1, nx + 1) mu <- rep(0, nx) sigma <- genPositiveDefMat(nx, covMethod = "unifcorrmat")$Sigma
X <- mvrnorm(n, mu, sigma)
y <- genY(betas, X)
xcor <- cor(y, X)
limitedX <- X[, abs(xcor) >= corLim]
pX <- X[, apply(X, 2, sigCor, y, pLim)]
nCoef <- c(ncol(X) + 1, ncol(limitedX) + 1, ncol(pX) + 1)
}


If we keep the defaults of corLim = 0.2 and pLim = 0.2 (keep any correlation where the absolute value exceeds 0.2 and any correlation with a p value less than 0.2) and use replicate to run the function 1000 times with 10 predictors and $$n = 1000$$, we can calculate the probability that the different method selects the correct model. (Note that we set all of values of betas to 1, so all 11 coefficient estimates should be non-zero and all of the predictors are correlated with the outcome).

results <- replicate(1000, (simulateSelection(10, 1000)))
correctModelFraction <- apply(results == 11, 1, mean)
correctModelFraction

## [1] 1.000 0.026 0.573


Using a filter based on the size of the correlation only yields the correct model 2.6% of the time. Using a relatively high p value as a threshold does better returning the correct model 57.3% of the time.

We can explore the effect of varying the value of pLim using a few simple modifications to the above code.

simulateSelectionP <- function(nx, n, pLim) {
require(MASS)
require(clusterGeneration)
betas <- rep(1, nx + 1)
mu <- rep(0, nx)
sigma <- genPositiveDefMat(nx, covMethod = "unifcorrmat")$Sigma X <- mvrnorm(n, mu, sigma) y <- genY(betas, X) pX <- X[, apply(X, 2, sigCor, y, pLim)] correct <- (ncol(pX)) == nx } varyP <- function(nx, n, p) { fractionCorrect <- numeric(100 * length(p)) j <- 1 for (i in 1:length(p)) { set.seed(42) fractionCorrect[(j):(j + 99)] <- replicate(100, simulateSelectionP(nx, n, p[i])) j <- j + 100 } fractionCorrect }  Using these new functions, we can calculate the fraction of the time that a threshold of p yields the correct model. If we vary p from $$0.01--0.50$$ and plot the resulting curve, prCorrect <- varyP(10, 1000, seq(0.01, 0.5, length.out = 20)) results <- data.frame(prCorrect = prCorrect, limit = rep(seq(0.01, 0.5, length.out = 20), each = 100)) meanByLimit <- aggregate(results$prCorrect, list(results$limit), mean) sdByLimit <- aggregate(results$prCorrect, list(results$limit), sd) meanByLimit$sd <- sdByLimit$x meanByLimit$upper <- meanByLimit$x + qnorm(0.975) * meanByLimit$sd/sqrt(100)
meanByLimit$lower <- meanByLimit$x - qnorm(0.975) * meanByLimit\$sd/sqrt(100)
ggplot(meanByLimit, aes(x = Group.1, y = x, ymin = lower, ymax = upper)) + geom_line() +
geom_ribbon(alpha = 0.25) + theme_bw() + scale_x_continuous("Threshold for p value") +
scale_y_continuous("Fraction of models in simulation that are correct")


we see that, as expected, the probability of getting the correct model goes up as we increase the threshold value. We have about a 75% shot at getting the right model using this method when we set the limit on the p value to be 0.50 compared to under 30% at a p value of 0.01.

However, this raises an interesting question: if we increase the threshold for the p value, what is the value of using the bivariate correlation at all?

randomRate <- function(n, pLim) {
x <- rnorm(n)
y <- rnorm(n)
sigCor(x, y, pLim)
}
p <- seq(0.01, 0.5, length.out = 20)
rate <- numeric(length(p))
for (i in 1:length(p)) {
rate[i] <- mean(replicate(1000, randomRate(1000, p[i])))
}
rbind(p, rate)

##       [,1]    [,2]    [,3]    [,4]   [,5]   [,6]   [,7]   [,8]   [,9]
## p    0.010 0.03579 0.06158 0.08737 0.1132 0.1389 0.1647 0.1905 0.2163
## rate 0.009 0.03900 0.06400 0.10100 0.1110 0.1290 0.1660 0.1870 0.2210
##       [,10]  [,11]  [,12]  [,13]  [,14]  [,15]  [,16]  [,17]  [,18]  [,19]
## p    0.2421 0.2679 0.2937 0.3195 0.3453 0.3711 0.3968 0.4226 0.4484 0.4742
## rate 0.2250 0.2920 0.3180 0.3180 0.3210 0.4010 0.3720 0.4560 0.4130 0.4660
##      [,20]
## p      0.5
## rate   0.5


As the threshold increases, the value of the bivariate correlation as a filter goes down and goes down fast. At the limit required for there to be a 50% chance of the model containing all relevant predictors, the rate of improperly included variables would go up to nearly 20%.

The real problem comes into play here — the bivariate comparsions selects for the wrong variables by over-emphasizing the relationship between the marginal distributions. Because the selected model is incomplete and important variables are omitted, the resulting parameter estimates are biased and inaccurate.

The bivariate comparsion is a terrible way to select relevant variables for a highly dimensional model as the function of interest is relating the all of the predictors to the outcome. It can neither rule in nor rule out a predictor from a model. A better approach is to generate the model using domain knowledge and a model of the expected data generating process. If the process is to remain strictly data driven, methods like cross-validation or AIC/BIC provide better measures of model quality and predictor importance than the bivariate correlation coefficient.