Why an inverse-Wishart prior may not be such a good idea

[This article was first published on dahtah » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

While playing around with Bayesian methods for random effects models, it occured to me that inverse-Wishart priors can really bite you in the bum. Inverse Wishart-priors are popular priors over covariance functions. People like them priors because they are conjugate to a Gaussian likelihood, i.e, if you have data {\mathbf{y}_{1},\ldots,\mathbf{y}_{n}} with each {\mathbf{y}_{i}}:

\displaystyle \mathbf{y}_{i}\sim\mathcal{N}\left(0,\mathbf{S}\right)

so that the {\mathbf{y}_{i}}‘s are correlated Gaussian vectors, and you wish to infer the correlation matrix S, then putting an inverse-Wishart prior on S is convenient because the posterior distribution is very easy to sample from and its mean can be computed analytically.

The inverse-Wishart works like this: a Wishart distribution {\mathcal{W}\left(m,\Lambda\right)} produces random positive definite matrices by first producing {m} Gaussian vectors:

\displaystyle \mathbf{x}_{i}\sim\mathcal{N}\left(0,\Lambda\right)

the Wishart sample is {m}-times the sample covariance matrix:

\displaystyle \mathbf{W}=\sum_{i=1}^{m}\mathbf{x}_{i}\mathbf{x}_{i}^{t}

which is why for large {m} Wishart samples will look like {m\Lambda}. {m} is the number of degrees of freedom, and we want to keep it low if the prior is to be relatively noninformative.

A matrix S has inverse Wishart distribution {\mathcal{IW}\left(m,\mathbf{M}\right)} if its inverse has Wishart distribution {\mathcal{W}\left(m,\mathbf{M}^{-1}\right)}. Again, since for large {m} the inverse will look like {m\mathbf{M}^{-1}}, for large {m} S will look like {\mathbf{M}/m}. More generally, the mean of the inverse Wishart is {\frac{\mathbf{M}}{m-p-1}}, where {p} is the dimension.

Let’s say we want to do Bayesian inference for the correlation of two Gaussian variables. Essentially, that’s a special case of what we started out with: {n} datapoints {\mathbf{y}_{i}\sim\mathcal{N}\left(0,\mathbf{S}\right)}, with each {\mathbf{y}_{i}} a 2-dimensional vector containing the observations for the two Gaussian variables. We are interested in one aspect of {p(\mathbf{S}|\mathbf{Y})}, namely the marginal for the correlation coefficient, i.e. {r=s_{21}/\sqrt{s_{11}s_{22}}}.

What are generally reasonable assumptions about {\mathbf{S}}? I think we don’t want the prior to be too sensitive to scale – we don’t know how big or small the variances of the components are going to be, but we might have a reasonable range. As far as the correlation coefficient is concerned, it’s fairly rare to have variables that are perfectly correlated, so we might want our prior to shrink correlation coefficients a bit.

A good default choice might then be to take {\mathbf{S}\sim\mathcal{IW}\left(3,\mathbf{I}\right)}, with I the identity matrix. Under the prior {E\left(\mathbf{S}\right)=\mathbf{I}}, and if we simulate from the prior we get a marginal distribution for the log-variance that looks like:

which is suitably spread over several orders of magnitude. The marginal distribution for the correlation coefficient is:

which is uniform – not what we wanted (we’d like very high absolute correlations to be less probable). That’s not the main problem with this prior, though. The big problem is this:

For each sample {\mathbf{S}_{j}} from the prior, I’m plotting the log-variance of the first component against the correlation coefficient. There’s a clear lack of independence, which is even easier to see in the conditional distribution of the correlation coefficient. Below, the histogram of the correlation coefficient conditioned on both variances being more than one (“High variance”), or not (“Low variance”):

What we see here is a strong dependence between variance and correlation: the prior says that high variance implies high correlation, and low variance implies low-to-moderate correlation. This is a disaster for inference, because it means that correlation will tend to be exagerated if variance is higher than expected, which is the opposite of the shrinkage behaviour we’d like to see.

There are better priors for covariance matrices out there, but sometimes you might be stuck with the Wishart for computational reasons (for example, it’s the only choice you have in INLA for random effects). An option is to estimate the variances first, then tweak the inverse-Wishart prior to have the right scale. Increasing the value of {m} will provide correlation shrinkage. From a Bayesian point of view this is moderately dirty, but preferable to just sticking with the default choice (and see here for a prior choice with good frequentist properties). Kass & Natarajan (2006) is a much more sophisticated version of this strategy.

Below, some R code to reproduce the plots:

require(plyr) require(ggplot2) require(mvtnorm) #Given a 2×2 covariance matrix, compute corr. coeff ccm <- function(M) { M[2,1]/prod(sqrt(diag(M))) } #Generate n samples from the prior rprior <- function(n=1,r=3,M=diag(rep(1,2))) { Minv <- solve(M) rlply(n,chol2inv(chol(rwish(r,Minv)))) } #Wishart samples rwish <- function(r,R) { X <- rmvnorm(r,sig=R) t(X)%*%X } plot.marginal.variance <- function(df) { p <- ggplot(df,aes(var1))+geom_histogram(aes(y=..density..))+scale_x_log() p+theme_bw()+labs(x="\nVariance of the first component",y="Density\n") } plot.marginal.correlation <- function(df) { p <- ggplot(df,aes(cor))+geom_histogram(aes(y=..density..)) p+theme_bw()+labs(x="\nCorrelation coefficient",y="Density\n")+scale_x_continuous(lim=c(-1,1)) } plot.corr.var <- function(df) { p <- ggplot(df,aes(var1,cor))+geom_point()+scale_x_log() p+theme_bw()+labs(x="\nVariance of the first component",y="Correlation coeffient\n") } plot.conditional.corr <- function(df) { df 1 & var2 > 1), labels=c(“Low variance”,”High variance”))) p <- ggplot(df,aes(cor))+geom_histogram(aes(y=..density..))+facet_wrap(~ high.var) p+theme_bw()+labs(x="\nCorrelation coefficient",y="Density\n")+scale_x_continuous(lim=c(-1,1)) } df <- ldply(rprior(2000),function(M) data.frame(var1=M[1,1],var2=M[2,2],cor=ccm(M)))

To leave a comment for the author, please follow the link and comment on their blog: dahtah » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)