Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Consider the following dataset, with (only) ten points

x=c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85)
y=c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3)
plot(x,y,pch=19,cex=2)

We want to get – say – two clusters. Or more specifically, two sets of observations, each of them sharing some similarities.

Since the number of observations is rather small, it is actually possible to get an exhaustive list of all partitions, and to minimize some criteria, such as the within variance. Given a vector with clusters, we compute the within variance using

within_var = function(I){
I0=which(I==0)
I1=which(I==1)
xbar0=mean(x[I0])
xbar1=mean(x[I1])
ybar0=mean(y[I0])
ybar1=mean(y[I1])
w=sum(I0)*sum( (x[I0]-xbar0)^2+(y[I0]-ybar0)^2 )+
sum(I1)*sum( (x[I1]-xbar1)^2+(y[I1]-ybar1)^2 )
return(c(I,w))
}

Then, to compute all possible partitions, use

base2=function(z,n=10){
Base.b=rep(0,n)
ndigits=(floor(logb(z, base=2))+1)
for(i in 1:ndigits){
Base.b[ n-i+1]=(z%%2)
z=(z%/%2)}
return(Base.b)}
L=function(x) within_var(base2(x))
S=sapply(1:(2^10),L)

The cluster indices at the mimimum is here

I=S[1:n,which.min(S[n+1,])]


To visualize those clusters, use

cluster_viz = function(indices){
library(RColorBrewer)
CL2palette=rev(brewer.pal(n = 9, name = "RdYlBu"))
CL2f=CL2palette[c(1,9)]
plot(x,y,pch=19,xlab="",ylab="",xlim=0:1,ylim=0:1,cex=2,col=CL2f[1+I])
CL2cl=CL2palette[c(3,7)]
I0=which(indices==0)
I1=which(indices==1)
xbar0=mean(x[I0])
xbar1=mean(x[I1])
ybar0=mean(y[I0])
ybar1=mean(y[I1])
segments(x[I0],y[I0],xbar0,ybar0,col=CL2c[1])
segments(x[I1],y[I1],xbar1,ybar1,col=CL2c[2])
points(xbar0,ybar0,pch=19,cex=1.5,col=CL2c[1])
points(xbar1,ybar1,pch=19,cex=1.5,col=CL2c[2])}

and then, simply

cluster_viz(I)

But that was possible only because $n$ is not to large (since the total number of scenarios – with only 2 clusters – is $2^n$, or $2^{n-1}$ if we changes zeros in ones).

One idea can be to use hierarchical clustering techniques. For instance, with a complete aggregation technique, we get

H=hclust(dist(cbind(x,y)),method="complete")
I=cutree(H,k=2)-1
cluster_viz(I)

If we change the aggregation technique, to Ward, we get

H=hclust(dist(cbind(x,y)),method="ward.D")
I=cutree(H,k=2)-1
cluster_viz(I)

or, for the single one,

H=hclust(dist(cbind(x,y)),method="single")
I=cutree(H,k=2)-1
cluster_viz(I)

Another idea is to use some $k$-means algorithm (e.g. Lloyd’s)

km=kmeans(cbind(x,y), centers=2, nstart = 1, algorithm = "Lloyd")
I=km$cluster-1 cluster_viz(I) But if we run it another time, we might have different starting points, and different clusters km=kmeans(cbind(x,y), centers=2, nstart = 1, algorithm = "Lloyd") I=km$cluster-1
cluster_viz(I)

or

even

A natural strategy is to run it many times, with different starting points, and to keep the one with the smallest within-variance.

But to be honest, I like the idea that drawing different starting point will yield different clusters. Actually, we can search some kind of robustness. For instance, which observations are more likely to belong to the same cluster as the second one (on top) ?

ref=2
M=NULL
for(s in 1:10000) {
km=kmeans(cbind(x,y), centers=2, nstart = 1, algorithm = "Lloyd")
I=km\$cluster-1
I=(I[ref]==1)*I+(I[ref]==0)*(1-I)
M=rbind(M,I) }
M_prop=apply(M,2,mean)

or the fifth one (on the left)