Generalized Double Pareto Priors for Regression

September 10, 2014
By

(This article was first published on Lindons Log » R, and kindly contributed to R-bloggers)

This post is a review of the “GENERALIZED DOUBLE PARETO SHRINKAGE” Statistica Sinica (2012) paper by Armagan, Dunson and Lee.

Consider the regression model (Y=Xbeta+varepsilon) where we put a generalized double pareto distribution as the prior on the regression coefficients (beta). The GDP distribution has density
$$begin{equation}
f(beta|xi,alpha)=frac{1}{2xi}left( 1+frac{|beta|}{alphaxi} right)^{-(alpha+1)}.
label{}
end{equation}$$

GDP as Scale Mixture of Normals

The GDP distribution can be conveniently represented as a scale mixture of normals. Let
$$begin{align*}
beta_{i}|phi,tau_{i} &sim N(0,phi^{-1}tau_{i})\
tau_{i}|lambda_{i}&sim Exp(frac{lambda_{i}^{2}}{2})\
lambda_{i}&sim Ga(alpha,eta)\
end{align*}$$
then (beta|phi sim GDP(xi=frac{eta}{sqrt{phi}alpha},alpha)).
To see this first note that (beta_{i}|phi,lambda_{i}) has a Laplace or Double Exponential distribution with rate parameter (sqrt{phi}lambda_{i}).
$$begin{align*}
p(beta_{i}|phi,lambda_{i})&=int p(beta_{i}|phi,tau_{i})p(tau_{i}|lambda_{i})dtau_{i}\
psi(t)&=int e^{itbeta_{i}} int p(beta_{i}|phi,tau_{i})p(tau_{i}|lambda_{i})dtau_{i} dbeta_{i}\
&=int int e^{itbeta_{i}}p(beta_{i}|phi,tau_{i})dbeta_{i}p(tau_{i}|lambda_{i})dtau_{i}\
&=int e^{-frac{1}{2}frac{tau_{i}}{phi}t^{2}}p(tau_{i}|lambda_{i})dtau_{i}\
&=frac{lambda_{i}^{2}}{2} int e^{-frac{1}{2}(frac{t^{2}}{phi}+frac{lambda_{i}^{2}}{2})tau_{i}}dtau_{i}\
&=frac{philambda_{i}^{2}}{t^{2}+philambda_{i}^{2}},
end{align*}$$
which is the characteristic function of a Double Exponential distribution with rate parameter (sqrt{phi}lambda_{i}).
Lastly
$$begin{align*}
p(beta_{i}|phi)&=int p(beta_{i}|phi,lambda_{i})p(lambda_{i})dlambda_{i}\
&=frac{1}{2}sqrt{phi}frac{eta^{alpha}}{Gamma(alpha)}frac{Gamma(alpha+1)}{(eta+sqrt{phi}|beta_{i}|)^{alpha+1}}\
&=frac{1}{2}frac{sqrt{phi}alpha}{eta}left( 1+frac{sqrt{phi}alpha}{eta}frac{|beta_{i}|}{alpha} right)^{-(alpha+1)},
end{align*}$$
which is the density of a (GDP(xi=frac{eta}{sqrt{phi}alpha},alpha)).

EM Algorithm

(tau_{i}) and (lambda_{i}) are treated as missing data for each (i).
begin{align*}
Q(beta,phi||beta^{(t)},phi^{(t)})&=c+mathbb{E}_{tau,lambda}left[ log p(beta,phi|Y,tau,lambda)|beta^{(t)},phi^{(t)} right]\
&=frac{n+p-3}{2}logphi – frac{phi}{2}||Y-Xbeta||^{2}-frac{phi}{2}sum_{i=1}^{p}beta_{i}^{2}mathbb{E}left[ frac{1}{tau_{i}} right]\
end{align*}

Expectation

For the iterated expectation one needs the distribution (tau_{i}|lambda_{i},beta_{i},phi) and (lambda_{i}|beta_{i},phi).
begin{align*}
p(tau_{i}|beta_{i},lambda_{i},phi)&propto p(beta_{i}|phi,tau_{i})p(tau_{i}|lambda_{i})\
&propto left( frac{1}{tau_{i}} right)^{frac{1}{2}}e^{-frac{1}{2}(frac{phi beta_{i}^{2}}{tau_{i}}+lambda_{i}^{2}tau_{i})}
end{align*}
This is the kernel of a Generalized Inverse Gaussian distribution, specifically (p(tau_{i}|beta_{i},lambda_{i},phi)=GIG(tau_{i}:lambda_{i}^{2},phi beta_{i}^{2},frac{1}{2})).
By a standard change of variables it follows that (p(frac{1}{tau_{i}}|beta_{i},lambda_{i},phi)=IG(frac{1}{tau_{i}}:sqrt{frac{lambda_{i}^{2}}{phi beta_{i}^{2}}},lambda_{i}^{2})) and so (mathbb{E}left[ frac{1}{tau_{i}}|lambda_{i},beta^{(t)},phi^{(t)} right]=frac{lambda_{i}}{sqrt{phi^{(t)}}|beta_{i}^{(t)}|}).

Recall that (p(beta_{i}|phi,lambda_{i})) has a double exponential distribution with rate (sqrt{phi}lambda_{i}).
Hence from (p(lambda_{i}|beta_{i},phi)propto p(beta_{i}|lambda_{i},phi)p(lambda_{i})) it follows that (lambda_{i}|beta_{i},phi sim Ga(alpha+1,eta+sqrt{phi}|beta_{i}|)), then performing the expectation with respect to (lambda_{i}) yields
begin{align*}
mathbb{E}left[ frac{1}{tau_{i}}|beta^{(t)},phi^{(t)} right]=left( frac{alpha+1}{eta+sqrt{phi^{t}}|beta_{i}^{(t)}|} right)left( frac{1}{sqrt{phi^{(t)}}|beta_{i}^{(t)}|} right)
end{align*}

Maximization

Writing (D^{(t)}=diag(mathbb{E}[frac{1}{tau_{1}}],dots,mathbb{E}[frac{1}{tau_{p}}])) the function to maximize is
begin{align*}
Q(beta,phi||beta^{(t)},phi^{(t)})&=c+mathbb{E}_{tau,lambda}left[ log p(beta,phi|Y,tau,lambda)|beta^{(t)},phi^{(t)} right]\
&=frac{n+p-3}{2}logphi – frac{phi}{2}||Y-Xbeta||^{2}-frac{phi}{2}beta^{‘}D^{(t)}beta,\
end{align*}
which is maximized by letting
begin{align*}
beta^{(t+1)}&=(X^{‘}X+D^{(t)})^{-1}X^{‘}Y\
phi^{(t+1)}&=frac{n+p-3}{Y^{‘}(I-X(X^{‘}X+D^{(t)})^{-1}X^{‘})Y}\
&=frac{n+p-3}{||Y-Xbeta^{(t+1)}||^{2}+beta^{(t+1)’}D^(t)beta^{(t+1)}}\
end{align*}

R CPP Code

#include 
// [[Rcpp::depends(RcppArmadillo)]]

using namespace Rcpp;
using namespace arma;

double gdp_log_posterior_density(int no, int p, double alpha, double eta, const Col& yo, const Mat& xo, const Col& B,double phi);

// [[Rcpp::export]]
List gdp_em(NumericVector ryo, NumericMatrix rxo, SEXP ralpha, SEXP reta){

	//Define Variables//
	int p=rxo.ncol();
	int no=rxo.nrow();
	double eta=Rcpp::as(reta);
	double alpha=Rcpp::as(ralpha);

	//Create Data//
	arma::mat xo(rxo.begin(), no, p, false);
	arma::colvec yo(ryo.begin(), ryo.size(), false);
	yo-=mean(yo);

	//Pre-Processing//
	Col xoyo=xo.t()*yo;
	Col B=xoyo/no;
	Col Babs=abs(B);
	Mat xoxo=xo.t()*xo;
	Mat D=eye(p,p);
	Mat Ip=eye(p,p);
	double yoyo=dot(yo,yo);
	double deltaB;
	double deltaphi;
	double phi=no/dot(yo-xo*B,yo-xo*B);
	double lp;

	//Create Trace Matrices
	Mat B_trace(p,20000);
	Col phi_trace(20000);
	Col lpd_trace(20000);

	//Run EM Algorithm//
	cout << "Beginning EM Algorithm" << endl;
	int t=0;
	B_trace.col(t)=B;
	phi_trace(t)=phi;
	lpd_trace(t)=gdp_log_posterior_density(no,p,alpha,eta,yo,xo,B,phi);
	do{
		t=t+1;


		Babs=abs(B);
		D=diagmat(sqrt(((eta+sqrt(phi)*Babs)%(sqrt(phi)*Babs))/(alpha+1)));
		B=D*solve(D*xoxo*D+Ip,D*xoyo);

		phi=(no+p-3)/(yoyo-dot(xoyo,B));

		//Store Values//
		B_trace.col(t)=B;
		phi_trace(t)=phi;
		lpd_trace(t)=gdp_log_posterior_density(no,p,alpha,eta,yo,xo,B,phi);

		deltaB=dot(B_trace.col(t)-B_trace.col(t-1),B_trace.col(t)-B_trace.col(t-1));
		deltaphi=phi_trace(t)-phi_trace(t-1);
	} while((deltaB>0.00001 || deltaphi>0.00001) && t<19999);
	cout << "EM Algorithm Converged in " << t << " Iterations" << endl;

	//Resize Trace Matrices//
	B_trace.resize(p,t);
	phi_trace.resize(t);
	lpd_trace.resize(t);

	return Rcpp::List::create(
			Rcpp::Named("B") = B,
			Rcpp::Named("B_trace") = B_trace,
			Rcpp::Named("phi") = phi,
			Rcpp::Named("phi_trace") = phi_trace,
			Rcpp::Named("lpd_trace") = lpd_trace
			) ;

}



double gdp_log_posterior_density(int no, int p, double alpha, double eta, const Col& yo, const Mat& xo, const Col& B,double phi){

	double lpd;
	double xi=eta/(sqrt(phi)*alpha);
	lpd=(double)0.5*((double)no-1)*log(phi/(2*M_PI))-p*log(2*xi)-(alpha+1)*sum(log(1+abs(B)/(alpha*xi)))-0.5*phi*dot(yo-xo*B,yo-xo*B)-log(phi);
	return(lpd);

}

An Example in R

rm(list=ls())
library(Rcpp)
library(RcppArmadillo)
sourceCpp("src/gdp_em.cpp")

#Generate Design Matrix
set.seed(3)
no=100
foo=rnorm(no,0,1)
sd=4
xo=cbind(foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd))
for(i in 1:40) xo=cbind(xo,foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd))

#Scale and Center Design Matrix
xo=scale(xo,center=T,scale=F)
var=apply(xo^2,2,sum)
xo=scale(xo,center=F,scale=sqrt(var/no))

#Generate Data under True Model
p=dim(xo)[2]
b=rep(0,p)
b[1]=1
b[2]=2
b[3]=3
b[4]=4
b[5]=5
xo%*%b
yo=xo%*%b+rnorm(no,0,1)
yo=yo-mean(yo)

#Run GDP
gdp=gdp_em(yo,xo,100,100)

#Posterior Density Increasing at Every Iteration?
gdp$lpd_trace[2:dim(gdp$lpd_trace)[1],1]-gdp$lpd_trace[1:(dim(gdp$lpd_trace)[1]-1),1]>=0
mean(gdp$lpd_trace[2:dim(gdp$lpd_trace)[1],1]-gdp$lpd_trace[1:(dim(gdp$lpd_trace)[1]-1),1]>=0)

#Plot Results
plot(gdp$B,ylab=expression(beta[GDP]),main="GDP MAP Estimate of Regression Coefficients")

Generalized Double Pareto Estimated Coefficients

WEST M. (1987). On scale mixtures of normal distributions, Biometrika, 74 (3) 646-648. DOI: http://dx.doi.org/10.1093/biomet/74.3.646

Artin Armagan, David Dunson, & Jaeyong Lee (2011). Generalized double Pareto shrinkage Statistica Sinica 23 (2013), 119-143 arXiv: 1104.0861v4

Figueiredo M.A.T. (2003). Adaptive sparseness for supervised learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, 25 (9) 1150-1159. DOI: http://dx.doi.org/10.1109/tpami.2003.1227989

Also see this similar post on the Bayesian lasso.

The post Generalized Double Pareto Priors for Regression appeared first on Lindons Log.

To leave a comment for the author, please follow the link and comment on their blog: Lindons Log » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)