How to sample from multidimensional distributions using Gibbs sampling?

Posted on October 8, 2017 by Appsilon Data Science Blog in R bloggers | 0 Comments

[This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We will show how to perform multivariate random sampling using one of the Markov Chain Monte Carlo (MCMC) algorithms, called the Gibbs sampler.

Random sampling with rabbit on the bed plane

via GIPHY

To start, what are MCMC algorithms and what are they based on?

Suppose we are interested in generating a random variable $X$ with a distribution of $\pi$ , over $\mathcal{X}$ .
If we are not able to do this directly, we will be satisfied with generating a sequence of random variables $X_0, X_1, \ldots$ , which in a sense tending to a distribution of $\pi$ . The MCMC approach explains how to do so:

Build a Markov chain $X_0, X_1, \ldots$ , for $\mathcal{X}$ , whose stationary distribution is $\pi$ . If the structure is correct, we should expect random variables to converge to $\pi$ .

In addition, we can expect that for function $f: \mathcal{X} \rightarrow \mathbb{R}$ , occurs:
$\mathbb{E}(f) = \lim_{n \rightarrow \infty} \frac{1}{n}\sum_{i=0}^{n-1}f(X_{i})$

with probability equals to 1.

In essence, we want the above equality to occur for any arbitrary random variable $X_0$ .

One of the MCMC algorithms that guarantee the above properties is the so-called Gibbs sampler.

Gibbs Sampler – description of the algorithm.

Assumptions:

$\pi$ is defined on the product space $\mathcal{X} = \Pi_{i=1}^{d}\mathcal{X_{i}}$
We are able to draw from the conditional distributions $\pi(X_{i}|X_{-i})$ , where $X_{-i} = (X_{1}, \ldots X_{i-1}, X_{i+1}, \ldots X_{d})$

Algorithm steps:

Select the initial values $X_{k}^{(0)}, k = 1, \ldots d$
For $t=1,2,\ldots$ repeat:

For $k = 1, \ldots d$ sample $X_{k}^{(t)}$ from distribution $\pi(X_{k}^{(t)}\rvert X_{1}^{(t)}, \ldots X_{k-1}^{(t)}, X_{k+1}^{(t-1)}, \ldots X_{d}^{(t-1)})$
Repeat step 2 until the distribution of vector $(X_{1}^{(t)}, \ldots X_{d}^{(t)})$ stabilizes.
The subsequent iterations of point 2 will provide a randomization of $\pi$ .

How do we understand point 3?
We are most satisfied when average coordinates stabilize to some accuracy.

Intuition in two-dimensional case:

Source: [3]

Gibbs sampling for randomization with a two-dimensional normal distribution.

We will sample from the distribution of $\theta \sim \mathcal{N}(0, [\sigma_{ij}])$ , where $\sigma_{11} = \sigma_{22} = 1$ and $\sigma_{12} = \sigma_{21} = \rho$ .

Knowing joint density of $\theta$ , it’s easy to show, that: $\theta_{1}|\theta_{2} \sim \mathcal{N}(\rho\theta_{2}, 1-\rho^{2})$
$\theta_{2}|\theta_{1} \sim \mathcal{N}(\rho\theta_{1}, 1-\rho^{2})$

R implementation:

gibbs_normal_sampling <- function(n_iteration, init_point, ro) {
  # init point is some numeric vector of length equals to 2
  theta_1 <- c(init_point[1], numeric(n_iteration))
  theta_2 <- c(init_point[2], numeric(n_iteration))
  for (i in 2:(n_iteration+1)) {
    theta_1[i] <- rnorm(1, ro * theta_2[i-1], sqrt(1 - ro^2))
    theta_2[i] <- rnorm(1, ro * theta_1[i], sqrt(1 - ro^2))
  }
  list(theta_1, theta_2)
}

Using the above function, let’s see how the 10 points were scored at $\rho = 0.5$ :

And for 10000 iterations:

We leave you a comparison of how the stability of the parameters changes depending on the selected $\rho$ parameter.

Let’s move on to use the Gibbs sampler to estimate the density parameters.

We will show the use of the Gibbs sampler and bayesian statistics to estimate the mean parameters in the mix of normal distributions.

Assumptions (simplified case):

iid. sample $Y=Y_{1},\ldots Y_{N}$ comes from a mixture of normal distributions $(1-\pi)\mathcal{N}(\mu_{1}, \sigma_{1}^{2})+\pi\mathcal{N}(\mu_{2}, \sigma_{2}^{2})$ , where $\pi$ , $\sigma_{1}$ i $\sigma_{2}$ are known.
For i=1,2 $\mu_{i} \sim \mathcal{N}(0, 1)$ (a priori distributions) and $\mu_{1}$ with $\mu_{2}$ are independent.
$\Delta = (\Delta_{1}, \ldots, \Delta_{N})$ - the classification vector (unobserved) from which Y is derived (when $\Delta_{i} = 0$ , $Y_{i}$ is drawn from $\mathcal{N}(\mu_{1}, \sigma_{1}^{2})$ , when $\Delta_{i} = 1$ then drawn from $\mathcal{N}(\mu_{2}, \sigma_{2}^{2})$ ).
$\mathbb{P}(\Delta_{i} = 1) = \pi$

With above assumptions it can be shown that:

$f(\Delta) = (1-\pi)^{N-\sum_{i=1}^{N}\Delta_{i}}\cdot\pi^{\sum_{i=1}^{N}\Delta_{i}}$
$(\mu_{1}\rvert\Delta,Y) \sim \mathcal{N}(\frac{\sum_{i=1}^{N}(1-\Delta_{i})y_{i}}{1 + \sum_{i=1}^{N}(1-\Delta_{i})}, \frac{1}{1+\sum_{i=1}^{N}(1-\Delta_{i})})$
$(\mu_{2}\rvert\Delta,Y) \sim \mathcal{N}(\frac{\sum_{i=1}^{N}\Delta_{i}y_{i}}{1 + \sum_{i=1}^{N}\Delta_{i}}, \frac{1}{1+\sum_{i=1}^{N}\Delta_{i}})$
$\mathbb{P} (\Delta_{i} = 1\rvert\mu_{1}, \mu_{2}, Y) = \frac{\pi \phi_{(\mu_{2}, \sigma_{2})}(y_{i})}{(1-\pi) \phi_{(\mu_1, \sigma_{1})}(y_{i})+\pi \phi_{(\mu_2, \sigma_{2})}(y_{i})}$

The form of the algorithm:

Choose starting point for the mean $(\mu_{1}^{0}, \mu_{2}^{0})$
In the -th iteration do:
- With the probability:
  $\frac{\pi \phi_{(\mu_{2}^{(k-1)}, \sigma_{2})}(y_{i})}{(1-\pi) \phi_{(\mu_{1}^{(k-1)}, \sigma_{1})}(y_{i})+\pi \phi_{(\mu_{2}^{(k-1)}, \sigma_{2})}(y_{i})}$ draw $\Delta_{i}^{(k)}$ for $i = 1, \ldots, N$
- Calculate:
  $\hat{\mu_{1}} = \frac{\sum_{i=1}^{N}(1-\Delta_{i}^{(k)})y_{i}}{1 + \sum_{i=1}^{N}(1-\Delta_{i}^{(k)})}$
  $\hat{\mu_{2}} = \frac{\sum_{i=1}^{N}\Delta_{i}^{(k)}y_{i}}{1 + \sum_{i=1}^{N}\Delta_{i}^{(k)}}$
- Generate:
  $\mu_{1}^{(k)} \sim \mathcal{N}(\hat{\mu_{1}}, \frac{1}{1 + \sum_{i=1}^{N}(1-\Delta_{i}^{(k)})})$
  $\mu_{2}^{(k)} \sim \mathcal{N}(\hat{\mu_{2}}, \frac{1}{1 + \sum_{i=1}^{N}\Delta_{i}^{(k)}})$
Perform step 2. until the distribution of vector $(\Delta, \mu_{1}, \mu_{2})$ stabilizes.

How to do this in R?

At start let’s generate random sample from mixture of normals with parameters $(\pi, \mu_1, \mu_2, \sigma_1, \sigma_2) = (0.7, 10, 2, 1, 2)$ .

set.seed(12345)
mu_1 <- 10
mu_2 <- 2
sigma_1 <- 1
sigma_2 <- 2
pi_known <- 0.7
n <- 2000
pi_sampled <- rbinom(n, 1, pi_known)
y_1 <- rnorm(n, mu_1, sigma_1)
y_2 <- rnorm(n, mu_2, sigma_2)
y <- (1 - pi_sampled) * y_1 + pi_sampled * y_2

Take a quick look at a histogram of our data:

The following task is to estimate $\mu_1$ and $\mu_2$ from above model.

mu_init <- c(12, 0)
n_iterations <- 300
delta <- vector("list", n_iterations)
mu_1_vec <- c(mu_init[1], numeric(n_iterations))
mu_2_vec <- c(mu_init[2], numeric(n_iterations))

delta_probability <- function(i, y, mu_1_vec, mu_2_vec, sigma_1, sigma_2, pi_known) {
  pi_known * dnorm(y, mu_2_vec[i - 1], sigma_2) /
      ((1- pi_known) * dnorm(y, mu_1_vec[i - 1], sigma_1) + pi_known * dnorm(y, mu_2_vec[i - 1], sigma_2))
}

mu_1_mean <- function(delta, i, y) {
  sum((1 - delta[[i - 1]]) * y) / (1 + sum(1 - delta[[i - 1]]))
}

mu_2_mean <- function(delta, i, y) {
  sum(delta[[i - 1]] * y) / (1 + sum(delta[[i - 1]]))
}

for (j in 2:(n_iterations + 1)) {
  delta[[j - 1]] <- y %>% map_int(
    ~ rbinom(1, 1, delta_probability(j, ., mu_1_vec, mu_2_vec, sigma_1, sigma_2, pi_known)))
  mu_1_vec[j] <- rnorm(1, mu_1_mean(delta, j, y), sqrt(1 / (1 + sum(1 - delta[[j - 1]]))))
  mu_2_vec[j] <- rnorm(1, mu_2_mean(delta, j, y), sqrt(1 / (1 + sum(delta[[j - 1]]))))
}

Let’s see the relation of sampled data to known one:

The following plot presents the mean of the $\Delta$ vector at each iteration:

Let’s check how mean of parameters $\mu_1$ and $\mu_2$ stabilize at used algorithm:

Note how little iteration was enough to stabilize the parameters.

Finally let’s see estimated density with our initial sample:

To those concerned about the topic, refer to [1] where you can find a generalization of normal distribution mixture by extending a priori distributions to other parameters.

It is also worth to compare the above algorithm with its deterministic counterpart, Expectation Maximization (EM) algorithm see [2].

Write your question and comments below. We’d love to hear what you think.

Resources:

[1] http://gauss.stat.su.se/rr/RR2006_1.pdf

[2] http://web.stanford.edu/~hastie/ElemStatLearn/

[3] http://vcla.stat.ucla.edu/old/MCMC/MCMC_tutorial.htm

Read the original post at Appsilon Data Science Blog.

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

How to sample from multidimensional distributions using Gibbs sampling?

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)