k-means clustering and Voronoi sets

February 22, 2015
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

In the context of http://latex.codecogs.com/gif.latex?k-means, we want to partition the space of our observations into http://latex.codecogs.com/gif.latex?k classes. each observation belongs to the cluster with the nearest mean. Here “nearest” is in the sense of some norm, usually the http://latex.codecogs.com/gif.latex?ell_2 (Euclidean) norm.

Consider the case where we have 2 classes. The means being respectively the 2 black dots. If we partition based on the nearest mean, with the http://latex.codecogs.com/gif.latex?ell_2 (Euclidean) norm we get the graph on the left, and with the http://latex.codecogs.com/gif.latex?ell_1 (Manhattan) norm, the one on the right,

Points in the red region are closer to the mean in the upper part, while points in the blue region are closer to the mean in the lower part. Here, we will always use the standard http://latex.codecogs.com/gif.latex?ell_2 (Euclidean) norm. Note that the graph above is related to Voronoi diagrams (or Voronoy, from Вороний in Ukrainian, or Вороно́й in Russian) with 2 points, the 2 means.

In order to illustrate the http://latex.codecogs.com/gif.latex?k-means clustering algorithm (here Lloyd’s algorithm) consider the following dataset

set.seed(1)
pts <- cbind(X=rnorm(500,rep(seq(1,9,by=2)/10,100),.022),Y=rnorm(500,.5,.15))
plot(pts)

Here, we have 5 groups.  So let us run a 5-means algorithm here.

  • we draw randomly 5 points in the space (intial values for the means), http://latex.codecogs.com/gif.latex?boldsymbol{mu}_1^{(1)},cdots,boldsymbol{mu}_k^{(1)}
  • in the assignment step, we assign each point to the nearest mean

http://latex.codecogs.com/gif.latex?S_i^{(t)}%20=%20big%20{%20boldsymbol{x}_j%20:%20big%20|%20boldsymbol{x}_j%20-%20boldsymbol{mu}^{(t)}_i%20big%20|^2%20leq%20big%20|%20boldsymbol{x}_j%20-%20boldsymbol{mu}^{(t)}_{i%27}%20big%20|^2%20%20forall%20i%27in{1%20,cdots,%20k}%20big}

  • in the update step, we compute the new centroids of the clusters

http://latex.codecogs.com/gif.latex?%20%20%20%20boldsymbol{mu}^{(t+1)}_i%20=%20frac{1}{|S^{(t)}_i|}%20sum_{boldsymbol{x}_j%20in%20S^{(t)}_i}%20boldsymbol{x}_j

To visualize it, see

The code the get the clusters is

kmeans(pts, centers=5, nstart = 1, algorithm = "Lloyd")

Observe that the assignment step is based on computations of Voronoi sets. This can be done in R using

library(tripack)
V <- voronoi.mosaic(means[,1],means[,2])
P <- voronoi.polygons(V)
points(V,pch=19)
plot(V,add=TRUE)

This is what we can visualize below

The code to visualize the http://latex.codecogs.com/gif.latex?k means, and the http://latex.codecogs.com/gif.latex?k clusters (or regions), use

km1 <- kmeans(pts, centers=5, nstart = 1, algorithm = "Lloyd")
library(tripack)
library(RColorBrewer)
CL5 <- brewer.pal(5, "Pastel1")
V <- voronoi.mosaic(km1$centers[,1],km1$centers[,2])
P <- voronoi.polygons(V)
plot(pts,pch=19,xlim=0:1,ylim=0:1,xlab="",ylab="",col=CL5[km1$cluster])
points(km1$centers[,1],km1$centers[,2],pch=3,cex=1.5,lwd=2)
plot(V,add=TRUE)

Here, starting points are draw randomly. If we run it again, we might get

or

On that dataset, it is difficult to get cluster that are the five groups we can actually see. If we use

set.seed(1)
A <- c(rep(.2,100),rep(.2,100),rep(.5,100),rep(.8,100),rep(.8,100))
B <- c(rep(.2,100),rep(.8,100),rep(.5,100),rep(.2,100),rep(.8,100))
pts <- cbind(X=rnorm(500,A,.075),Y=rnorm(500,B,.075))

we usually get something better

Colors are obtained from clusters of the http://latex.codecogs.com/gif.latex?k-means function, but additional lines are obtained using as outputs of Voronoi diagrams functions.

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)