Visualizing Clusters

February 24, 2015
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

Consider the following dataset, with (only) ten points

x=c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85)
y=c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3)
plot(x,y,pch=19,cex=2)

We want to get – say – two clusters. Or more specifically, two sets of observations, each of them sharing some similarities.

Since the number of observations is rather small, it is actually possible to get an exhaustive list of all partitions, and to minimize some criteria, such as the within variance. Given a vector with clusters, we compute the within variance using

within_var = function(I){
I0=which(I==0)
I1=which(I==1)
xbar0=mean(x[I0])
xbar1=mean(x[I1])
ybar0=mean(y[I0])
ybar1=mean(y[I1])
w=sum(I0)*sum( (x[I0]-xbar0)^2+(y[I0]-ybar0)^2 )+
  sum(I1)*sum( (x[I1]-xbar1)^2+(y[I1]-ybar1)^2 )
return(c(I,w))
}

Then, to compute all possible partitions, use

base2=function(z,n=10){
  Base.b=rep(0,n)
  ndigits=(floor(logb(z, base=2))+1)
  for(i in 1:ndigits){
    Base.b[ n-i+1]=(z%%2)
    z=(z%/%2)}
  return(Base.b)}
L=function(x) within_var(base2(x))
S=sapply(1:(2^10),L)

The cluster indices at the mimimum is here

I=S[1:n,which.min(S[n+1,])]

To visualize those clusters, use

cluster_viz = function(indices){
library(RColorBrewer)
CL2palette=rev(brewer.pal(n = 9, name = "RdYlBu"))
CL2f=CL2palette[c(1,9)]
plot(x,y,pch=19,xlab="",ylab="",xlim=0:1,ylim=0:1,cex=2,col=CL2f[1+I])
CL2cl=CL2palette[c(3,7)]
I0=which(indices==0)
I1=which(indices==1)
xbar0=mean(x[I0])
xbar1=mean(x[I1])
ybar0=mean(y[I0])
ybar1=mean(y[I1])
segments(x[I0],y[I0],xbar0,ybar0,col=CL2c[1])
segments(x[I1],y[I1],xbar1,ybar1,col=CL2c[2])
points(xbar0,ybar0,pch=19,cex=1.5,col=CL2c[1])
points(xbar1,ybar1,pch=19,cex=1.5,col=CL2c[2])}

and then, simply

cluster_viz(I)

But that was possible only because http://latex.codecogs.com/gif.latex?n is not to large (since the total number of scenarios – with only 2 clusters – is http://latex.codecogs.com/gif.latex?2^n, or http://latex.codecogs.com/gif.latex?2^{n-1} if we changes zeros in ones).

One idea can be to use hierarchical clustering techniques. For instance, with a complete aggregation technique, we get

H=hclust(dist(cbind(x,y)),method="complete")
I=cutree(H,k=2)-1
cluster_viz(I)

If we change the aggregation technique, to Ward, we get

H=hclust(dist(cbind(x,y)),method="ward.D")
I=cutree(H,k=2)-1
cluster_viz(I)

or, for the single one,

H=hclust(dist(cbind(x,y)),method="single")
I=cutree(H,k=2)-1
cluster_viz(I)

Another idea is to use some http://latex.codecogs.com/gif.latex?k-means algorithm (e.g. Lloyd’s)

km=kmeans(cbind(x,y), centers=2, nstart = 1, algorithm = "Lloyd")
I=km$cluster-1
cluster_viz(I)

But if we run it another time, we might have different starting points, and different clusters

km=kmeans(cbind(x,y), centers=2, nstart = 1, algorithm = "Lloyd")
I=km$cluster-1
cluster_viz(I)

or

even

A natural strategy is to run it many times, with different starting points, and to keep the one with the smallest within-variance.

But to be honest, I like the idea that drawing different starting point will yield different clusters. Actually, we can search some kind of robustness. For instance, which observations are more likely to belong to the same cluster as the second one (on top) ?

ref=2
M=NULL
for(s in 1:10000) {
km=kmeans(cbind(x,y), centers=2, nstart = 1, algorithm = "Lloyd")
I=km$cluster-1
I=(I[ref]==1)*I+(I[ref]==0)*(1-I)
M=rbind(M,I) }
M_prop=apply(M,2,mean)

or the fifth one (on the left)

 

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)