Dirichlet Process, Infinite Mixture Models, and Clustering

April 7, 2013

(This article was first published on Statistical Research » R, and kindly contributed to R-bloggers)

The Dirichlet process provides a very interesting approach to understand group assignments and models for clustering effects.   Often time we encounter the k-means approach.  However, it is necessary to have a fixed number of clusters.  Often we encounter situations where we don’t know how many fixed clusters we need.  Suppose we’re trying to identify groups of voters.  We could use political partisanship (e.g. low/medium/high Democratic vote) but that may not necessary describe the data appropriately.  If this is the case then we can turn to Bayesian nonparametrics and the Dirichlet Process and use some approaches there to solve this problem.  Three in particular are commonly used as examples: the Chinese Restaurant ModelPólya’s Urn, and Stick Breaking.

Chinese Restaurant Model

Chinese Restaurant Model

 The Chinese Restaurant Model is based on idea that there is a restaurant with an infinite number of tables.  At each table there are an infinite number of seats.  The first customer arrives and sits down at a table.  The second customer then arrives and selects a table.  However, the customer selects the table that the first customer is currently sitting with probability \alpha/(1+\alpha) or selects a new table with 1/(1+\alpha).  This continues on to the (n+1)^{st} customer where they select a table that a current customer is sitting with probability n_{k}/(n+\alpha).

crp = function(num.customers, alpha) {
table < - c(1)
next.table <- 2
for (i in 1:num.customers) {
if (runif(1,0,1) < alpha / (alpha + i)) {
# Add a new ball color.
table <- c(table, next.table)
next.table <- next.table+1
} else {
# Pick out a ball from the urn, and add back a
# ball of the same color.
select.table <- table[sample(1:length(table), 1)]
table <- c(table, select.table)
crp(100, 4)
table( crp(10000, 2) )
,xlab="Table Index", ylab="Frequency"


Pólya’s Urn Model

In the Pólya’s Urn model we take the approach where there exists a urn of colored balls.  We take a ball out of the urn and note its color.  We replace the ball back into the urn and then we add an additional ball of the same color to the urn. This process can continue on infinitely.

rgb2hex < - function(x){
hex.part = ""
hex <- ""
for(i in 1:3){
b16 <- x[,i]/16
int.one <- trunc(b16)
val.one < - letters[int.one-10+1]
} else {
val.one <- int.one
fract <- abs( b16 - int.one )
int.two <- fract*16
val.two < - letters[int.two-10+1]
} else {
val.two <- int.two
hex.part[i] <- paste(val.one,val.two, sep="")
hex <- paste(hex,hex.part[i], sep="")
polyaUrnModel = function(baseColorDistribution, num.balls, alpha) {
balls < - c()
for (i in 1:num.balls) {
if (runif(1,0,1) < alpha / (alpha + length(balls))) {
# Add a new ball color.
color.comb < - expand.grid(x=seq(0,255),y=seq(0,255),z=seq(0,255))
location.picker <- rnorm(1,nrow(color.comb)/2, (nrow(color.comb)-1)/4 )
the.color <- c( color.comb[location.picker,1], color.comb[location.picker,2], color.comb[location.picker,3])
the.hex <- paste("#",rgb2hex(the.color), sep="")
new.color <- the.hex
balls = c(balls, new.color)
} else {
# Pick out a ball from the urn, and add back a
# ball of the same color.
ball = balls[sample(1:length(balls), 1)]
balls = c(balls, ball)
pum < - polyaUrnModel(function() rnorm(1,0,1), 100, 1)
barplot( table(pum), col=names(table(pum)), pch=10 )


Polya Urn Model

Stick Breaking Model

With this third model we simply start breaking a stick and continue to break that stick into smaller pieces.  This process works by taking a stick of length 1.0.  We then generate one random number from the Beta distribution (\beta_{1} ~ Beta(1,\alpha). We then break the stick at \beta_1. The left side of the stick we’ll call \nu_1.  We then take the remaining stick to the right and break it again at location (\beta_{2}  ~ Beta(1, \alpha). Once again the piece to the left we’ll call \nu_2 = \left(1-\beta_1\right) \cdot \beta_2. The sum of all of the probabilities generated will add up to 1.0.

stickBreakingModel = function(num.vals, alpha) {
betas = rbeta(num.vals, 1, alpha)
stick.to.right = c(1, cumprod(1 - betas))[1:num.vals]
weights = stick.to.right * betas

plot( stickBreakingModel(100,5), pch=16, cex=.5 )


Stick Breaking Probabilities

Multivariate Clustering

# Generate some fake data with some uniform random means
generateFakeData < - function( num.vars=3, n=100, num.clusters=5, seed=NULL ) {
} else {
data <- data.frame(matrix(NA, nrow=n, ncol=num.vars+1))

mu <- NULL
for(m in 1:num.vars){
mu <- cbind(mu,rnorm(num.clusters, runif(1,-10,15), 5))

for (i in 1:n) {
cluster <- sample(1:num.clusters, 1)
data[i, 1] <- cluster
for(j in 1:num.vars){
data[i, j+1] <- rnorm(1, mu[cluster,j], 1)

data$X1 <- factor(data$X1)
var.names <- paste("VAR",seq(1,ncol(data)-1), sep="")
names(data) <- c("cluster",var.names)


# Set up a procedure to calculate the cluster means using squared distance
dirichletClusters <- function(orig.data, disp.param = NULL, max.iter = 100, tolerance = .001)
n <- nrow( orig.data )

data <- as.matrix( orig.data )
pick.clusters <- rep(1, n)
k <- 1

mu <- matrix( apply(data,2,mean), nrow=1, ncol=ncol(data) )

is.converged <- FALSE
iteration <- 0

ss.old <- Inf
ss.curr <- Inf

while ( !is.converged & iteration < max.iter ) { # Iterate until convergence
iteration <- iteration + 1

for( i in 1:n ) { # Iterate over each observation and measure the distance each observation' from its mean center's squared distance from its mean
distances <- rep(NA, k)

for( j in 1:k ){
distances[j] <- sum( (data[i, ] - mu[j, ])^2 ) # Distance formula.

if( min(distances) > disp.param ) { # If the dispersion parameter is still less than the minimum distance then create a new cluster
k < - k + 1
pick.clusters[i] <- k
mu <- rbind(mu, data[i, ])
} else {
pick.clusters[i] <- which(distances == min(distances))


# Calculate new cluster means
for( j in 1:k ) {
if( length(pick.clusters == j) > 0 ) {
mu[j, ] < - apply(subset(data,pick.clusters == j), 2, mean)

# Test for convergence
ss.curr <- 0
for( i in 1:n ) {
ss.curr <- ss.curr +
sum( (data[i, ] - mu[pick.clusters[i], ])^2 )
ss.diff <- ss.old - ss.curr
ss.old <- ss.curr
if( !is.nan( ss.diff ) & ss.diff < tolerance ) {
is.converged <- TRUE


centers <- data.frame(mu)
ret.val <- list("centers" = centers, "cluster" = factor(pick.clusters),
"k" = k, "iterations" = iteration)


create.num.vars <- 3
orig.data <- generateFakeData(create.num.vars, num.clusters=3, n=1000, seed=123)
dp.update <- dirichletClusters(orig.data[,2:create.num.vars+1], disp.param=25)
ggplot(orig.data, aes(x = VAR1, y = VAR3, color = cluster)) + geom_point()

 Clustering 1
In this example I have provided some R code that clusters variables based an any given number of variables.  The measure of distance from the group centroid is the multivariate sum of squared distance, though there are many other distance metrics that could be implemented (e.g. manhattan, euclidean, etc.)
Clustering 2

To leave a comment for the author, please follow the link and comment on their blog: Statistical Research » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training




CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)