# Impact of correlated predictions on the variance of an ensemble model

August 21, 2014
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Let $X_1$ and $X_2$ be the prediction errors of two statistical/machine learning algorithms. $X_1$ and $X_2$ have relatively low bias, and high variances $\sigma^2_1$ and $\sigma^2_2$. They are also correlated, having a Pearson correlation coefficient equal to $\rho_{1, 2}$.

Aggregating models $1$ and $2$ might result in a predictive model with lower prediction error variance than $1$ and $2$. But not all the times. For those who attended statistics/probability/portfolio optimization classes, this may probably sound obvious; you can directly jump to the illustrative R part, below.

Let $Z := \alpha X_1 + (1-\alpha) X_2$, with $\alpha \in \left[0, 1\right]$, be the prediction error of the ensemble model built with $1$ and $2$. We have : $Var(Z) = \alpha^2 \sigma^2_1 + (1 - \alpha)^2\sigma^2_2 + 2\alpha(1-\alpha)Cov(X_1, X_2)$

And from the fact that : $Cov(X_1, X_2) = \rho_{1, 2} \sigma_1\sigma_2$

We get : $Var(Z) = \alpha^2 \sigma^2_1 + (1 - \alpha)^2\sigma^2_2 + 2\alpha(1-\alpha) \rho_{1, 2} \sigma_1\sigma_2$

Now, let’s see how $Var(Z)$ changes with an increase of $\alpha$, the ensemble’s allocation for model $1$ : $\frac{\partial Var(Z)}{\partial \alpha}= 2\alpha \sigma^2_1 - 2 (1 - \alpha) \sigma^2_2 + 2(1-2\alpha) \rho_{1, 2} \sigma_1\sigma_2$

When $\alpha$ is close to $0$, that is, when the ensemble contains almost only model $2$, we have : $\frac{\partial Var(Z)}{\partial \alpha}_{|\alpha = 0}= 2 \left( \rho_{1, 2} \sigma_1\sigma_2 - \sigma^2_2 \right) = 2 \sigma^2_2\left( \rho_{1, 2} \frac{\sigma_1}{\sigma_2} - 1 \right)$

That’s the relative change in the variance of the ensemble prediction error, induced by introducing model $1$ in an ensemble containing almost only $2$. Hence, if $\rho_{1, 2} = \frac{\sigma_2}{\sigma_1}$, increasing the allocation of model $1$ won’t increase or decrease the variance of ensemble prediction at all. The variance will decrease if we introduce in the ensemble a model $1$, so that $\rho_{1, 2} \leq \frac{\sigma_2}{\sigma_1}$. If $\rho_{1, 2} \geq \frac{\sigma_2}{\sigma_1}$, it won’t decrease, no matter the combination of $X_1$ and $X_2$.

For a simple illustrative example in R, I create simulated data observations :

  # number of observations
n <- 100
u <- 1:n

# Simulated observed data
intercept <- 5
slope <- 0.2
set.seed(2)

## data
y <-  intercept + slope*u + rnorm(n)
plot(u, y, type = 'l', main = "Simulated data observations")
points(u, y, pch = 19) I fit a linear regression model to the data, as a benchmark model :

  # Defining training and test sets
train <- 1:(0.7*n)
test <- -train
u.train <- u[(train)]
y.train <- y[(train)]
u.test <- u[(test)]
y.test <- y[(test)]

# Fitting the benchmark model to the training set
fit.lm <- lm(y.train ~ u.train)

(slope.coeff <- fit.lm$coefficients) (intercept.coeff <- fit.lm$coefficients)

## u.train
## 0.1925
## (Intercept)
## 5.292


Obtaining the predicted values from the benchmark model, and prediction error

  # Predicted values from benchmark model on the test set
y.pred <- intercept.coeff + slope.coeff*u.test

# prediction error from linear regression
pred.err <- y.pred - y.test
(mean.pred.err <- mean(pred.err))
(var.pred.err <- var(pred.err))

##  -0.1802
##  1.338


Now, I consider two other models, $1$ and $2$ : $y = a + b \times u + sin(u) + \epsilon_1$

and $y = a + b \times u - 0.35 \times sin(u) + \epsilon_2$

with $\epsilon_1$ and $\epsilon_2$ being 2 correlated gaussians with zero mean and constant variances (well, not really “models” that i fit, but these help me to build fictitious various predictions with different correlations for the illustration). The slope and intercept are those obtained from the benchmark model.

 # Alternative model 1 (low bias, high variance, oscillating)
m <- length(y.pred)
eps1 <- rnorm(m, mean = 0, sd = 1.5)
y.pred1 <- intercept.coeff + slope.coeff*u.test + sin(u.test) + eps1

# prediction error for model 1
pred.err1 <- y.pred1 - y.test


We can visualize the predictions of model $1$, $2$ and the benchmark, with different prediction errors correlations between model $1$ and $2$, to get an intuition of the possible ensemble predictions :

  # Different prediction errors correlations for 1 and 2
rho.vec <- c(-1, -0.8, 0.6, 1)

# Independent random gaussian numbers defining model 2 errors
eps <- rnorm(m, mean = 0, sd = 2)

# Plotting the predictions with different correlations
par(mfrow=c(2, 2))
for (i in 1:4)
{
rho <- rho.vec[i]

# Correlated gaussian numbers (Cholesky decomposition)
eps2 <- rho*eps1 + sqrt(1 - rho^2)*eps

# Alternative  model 2 (low bias, higher variance than 1, oscillating)
y.pred2 <- intercept.coeff + slope.coeff*u.test - 0.35*sin(u.test) + eps2

# prediction error for model 2
pred.err2 <- y.pred2 - y.test

# predictions from 1 & 2 correlation
corr.pred12 <- round(cor(pred.err1, pred.err2), 2)

# Plot
plot(u.test, y.test, type = "p",
xlab = "test set values", ylab = "predicted values",
main = paste("models 1 & 2 pred. values n correlation :",
corr.pred12))
points(u.test, y.test, pch = 19)
lines(u.test, y.pred, lwd = 2)
lines(u.test, y.pred1,
col = "blue", lwd = 2)
lines(u.test, y.pred2,
col = "red", lwd = 2)
} Now including allocations of models $1$ and $2$, we can see how the ensemble variance evolves as a function of allocation and correlation :

  # Allocation for model 1 in the ensemble
alpha.vec <- seq(from = 0, to = 1, by = 0.05)

# Correlations between model 1 and model 2
rho.vec <- seq(from = -1, to = 1, by = 0.05)

# Results matrices
nb.row <- length(alpha.vec)
nb.col <- length(rho.vec)

## Average prediction errors of the ensemble
mean.pred.err.ens <- matrix(0, nrow = nb.row, ncol = nb.col)
rownames(mean.pred.err.ens) <- paste0("pct. 1 : ", alpha.vec*100, "%")
colnames(mean.pred.err.ens) <- paste0("corr(1, 2) : ", rho.vec)

## Variance of prediction error of the ensemble
var.pred.err.ens <- matrix(0, nrow = nb.row, ncol = nb.col)
rownames(var.pred.err.ens) <- paste0("pct. 1 : ", alpha.vec*100, "%")
colnames(var.pred.err.ens) <- paste0("corr(1, 2) : ", rho.vec)

# loop on correlations and allocations
for (i in 1:nb.row)
{
for (j in 1:nb.col)
{
alpha <- alpha.vec[i]
rho <- rho.vec[j]

# Alternative model 2 (low bias, higher variance, oscillating)
eps2 <- rho*eps1 + sqrt(1 - rho^2)*eps
y.pred2 <- intercept.coeff + slope.coeff*u.test - 0.35*sin(u.test) + eps2
pred.err2 <- y.pred2 - y.test

# Ensemble prediction error
z <- alpha*pred.err1 + (1-alpha)*pred.err2
mean.pred.err.ens[i, j] <- mean(z)
var.pred.err.ens[i, j] <-  var(z)
}
}
res.var <- var.pred.err.ens

# Heat map for the variance of the ensemble
filled.contour(alpha.vec, rho.vec, res.var, color = terrain.colors,
main = "prediction error variance for the ensemble",
xlab = "allocation in x1", ylab = "correlation between 1 and 2") Hence, the lower the correlation between $1$ and $2$, the lower the variance of the ensemble model prediction. This, combined with an allocation of $1$ comprised between $35\%$ and $50\%$ seem to be the building blocks for the final model. The ensemble models biases helps in making a choice of allocation (this is actually an optimization problem that can be directly solved with portfolio theory).

Finding the final ensemble, with lower variance and lower bias :

# Final ensemble

## Allocation
alpha.vec[which.min(as.vector(res.var))]
##  0.45

## Correlation
res.bias <- abs(mean.pred.err.ens)
which.min(res.bias[which.min(as.vector(res.var)), ])
## corr(1, 2) :
-0.7
## 7


Creating the final model with these parameters :

    rho <- -0.7
eps2 <- rho*eps1 + sqrt(1 - rho^2)*eps

# Alternative model 2 (low bias, higher variance, oscillating)
y.pred2 <- intercept.coeff + slope.coeff*u.test - 0.35*sin(u.test) + eps2

# Final ensemble prediction
y.pred.ens <- 0.45*y.pred1 + 0.55*y.pred2

# Plot
plot(u.test, y.test, type = "p",
xlab = "test set", ylab = "predicted values",
main = "Final ensemble model (green)")
points(u.test, y.test, pch = 19)
# benchmark
lines(u.test, y.pred, lwd = 2)
# model 1
lines(u.test, y.pred1, col = "blue", lwd = 2)
# model 2
lines(u.test, y.pred2, col = "red", lwd = 2)
# ensemble model with 1 and 2
points(u.test, y.pred.ens, col = "green", pch = 19)
lines(u.test, y.pred.ens, col = "green", lwd = 2) Performance of the final model :

    # Benchmark
pred.err <- y.pred - y.test
# Model 1
pred1.err <- y.pred1 - y.test
# Model 2
pred2.err <- y.pred2 - y.test
# Ensemble model
pred.ens.err <- y.pred.ens - y.test

# Bias comparison
bias.ens <- mean(y.pred.ens - y.test)
bias.ens_vsbench <- (bias.ens/mean(y.pred - y.test) - 1)*100
bias.ens_vs1 <- (bias.ens/mean(y.pred1 - y.test) - 1)*100
bias.ens_vs2 <- (bias.ens/mean(y.pred2 - y.test) - 1)*100

# Variance comparison
var.ens <- var(y.pred.ens - y.test)
var.ens_vsbench <- (var.ens/var(y.pred - y.test) - 1)*100
var.ens_vs1 <- (var.ens/var(y.pred1 - y.test) - 1)*100
var.ens_vs2 <- (var.ens/var(y.pred2 - y.test) - 1)*100

cbind(c(bias.ens_vsbench, bias.ens_vs1, bias.ens_vs2), c(var.ens_vsbench, var.ens_vs1, var.ens_vs2))
##        [,1]  [,2]
## [1,] -95.31 46.27
## [2,] -112.12 -62.93
## [3,] -88.33 -45.53  R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.