Fun with the proto package: building an MCMC sampler for Bayesian regression

August 12, 2010
By

(This article was first published on Last Resort Software, and kindly contributed to R-bloggers)

The proto package is my latest favourite R goodie. It brings prototype-based programming to the R language - a style of programming that lets you do many of the things you can do with classes, but with a lot less up-front work. Louis Kates and Thomas Petzoldt provide an excellent introduction to using proto in the package vignette.

As a learning exercise I concocted the example below involving Bayesian logistic regression. It was inspired by an article on Matt Shotwell's blog about using R environments rather than lists to store the state of a Markov Chain Monte Carlo sampler. Here I use proto to create a parent class-like object (or trait in proto-ese) to contain the regression functions and create child objects to hold both data and results for individual analyses.

First here's an example session...

# Make up some data with a continuous predictor and binary response
nrec <- 500
x <- <- rnorm(nrec)
y <- rbinom(nrec, 1, plogis(2 - 4*x))

# Predictor matrix with a col of 1s for intercept
pred <- matrix(c(rep(1, nrec), x), ncol=2)
colnames(pred) <- c("intercept", "X")

# Load the proto package
library(proto)

# Use the Logistic parent object to create a child object which will
# hold the data and run the regression (the $ operator references
# functions and data within a proto object)
lr <- Logistic$new(pred, y)
lr$run(5000, 1000)

# lr now contains both data and results
str(lr)

proto object
$ cov : num [1:2, 1:2] 0.05 -0.0667 -0.0667 0.1621
..- attr(*, "dimnames")=List of 2
$ prior.cov: num [1:2, 1:2] 100 0 0 100
$ prior.mu : num [1:2] 0 0
$ beta : num [1:5000, 1:2] 2.09 2.09 2.09 2.21 2.21 ...
..- attr(*, "dimnames")=List of 2
$ adapt : num 1000
$ y : num [1:500] 0 1 1 1 1 1 1 1 1 1 ...
$ x : num [1:500, 1:2] 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "dimnames")=List of 2
parent: proto object

# Use the Logistic summary function to tabulate and plot results
lr$summary()

From 5000 samples after 1000 iterations burning in
intercept X
Min. :1.420 Min. :-5.296
1st Qu.:1.840 1st Qu.:-3.915
Median :2.000 Median :-3.668
Mean :1.994 Mean :-3.693
3rd Qu.:2.128 3rd Qu.:-3.455
Max. :2.744 Max. :-2.437



And here's the code for the Logistic trait...

Logistic <- proto()

Logistic$new <- function(., x, y) {
# Creates a child object to hold data and results
#
# x - a design matrix (ie. predictors
# y - a binary reponse vector

proto(., x=x, y=y)
}

Logistic$run <- function(., niter, adapt=1000) {
# Perform the regression by running the MCMC
# sampler
#
# niter - number of iterations to sample
# adapt - number of prior iterations to run
# for the 'burning in' period

require(mvtnorm)

# Set up variables used by the sampler
.$adapt <- adapt
total.iter <- niter + adapt
.$beta <- matrix(0, nrow=total.iter, ncol=ncol(.$x))
.$prior.mu <- rep(0, ncol(.$x))
.$prior.cov <- diag(100, ncol(.$x))
.$cov <- diag(ncol(.$x))

# Run the sampler
b <- rep(0, ncol(.$x))
for (i in 1:total.iter) {
b <- .$update(i, b)
.$beta[i,] <- b
}

# Trim the results matrix to remove the burn-in
# period
if (.$adapt > 0) {
.$beta <- .$beta[(.$adapt + 1):total.iter,]
}
}

Logistic$update <- function(., it, beta.old) {
# Perform a single iteration of the MCMC sampler using
# Metropolis-Hastings algorithm.
# Adapted from code by Brian Neelon published at:
# http://www.duke.edu/~neelo003/r/
#
# it - iteration number
# beta.old - vector of coefficient values from
# the previous iteration

# Update the coefficient covariance if we are far
# enough through the sampling
if (.$adapt > 0 & it > 2 * .$adapt) {
.$cov <- cov(.$beta[(it - .$adapt):(it - 1),])
}

# generate proposed new coefficient values
beta.new <- c(beta.old + rmvnorm(1, sigma=.$cov))

# calculate prior and current probabilities and log-likelihood
if (it == 1) {
.$..log.prior.old <- dmvnorm(beta.old, .$prior.mu, .$prior.cov, log=TRUE)
.$..probs.old <- plogis(.$x %*% beta.old)
.$..LL.old <- sum(log(ifelse(.$y, .$..probs.old, 1 - .$..probs.old)))
}
log.prior.new <- dmvnorm(beta.new, .$prior.mu, .$prior.cov, log=TRUE)
probs.new <- plogis(.$x %*% beta.new)
LL.new <- sum(log(ifelse(.$y, probs.new, 1-probs.new)))

# Metropolis-Hastings acceptance ratio (log scale)
ratio <- LL.new + log.prior.new - .$..LL.old - .$..log.prior.old

if (log(runif(1)) < ratio) {
.$..log.prior.old <- log.prior.new
.$..probs.old <- probs.new
.$..LL.old <- LL.new
return(beta.new)
} else {
return(beta.old)
}
}

Logistic$summary <- function(., show.plot=TRUE) {
# Summarize the results

cat("From", nrow(.$beta), "samples after", .$adapt, "iterations burning in\n")
base::print(base::summary(.$beta))

if (show.plot) {
par(mfrow=c(1, ncol(.$beta)))
for (i in 1:ncol(.$beta)) {
plot(density(.$beta[,i]), main=colnames(.$beta)[i])
}
}
}


Now that's probably not the greatest design in the world, but it only took me a few minutes to put it together and it's incredibly easy to modify or extend. Try it !

Thanks to Brian Neelon for making his MCMC logistic regression code available (http://www.duke.edu/~neelo003/r/).

To leave a comment for the author, please follow the link and comment on his blog: Last Resort Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags:

Comments are closed.