Buster – a new R package for bagging hierarchical clustering

July 9, 2014
By

(This article was first published on Drunks&Lampposts » R, and kindly contributed to R-bloggers)

I recently found myself a bit stuck. I needed to cluster some data. The distances between the data points were not representable in Euclidean space so I had to use hierarchical clustering. But then I wanted stable clusters that would retain their shape as I updated the data set with new observations. This I could do using fuzzy clustering but that (to my knowledge) is only available for clustering techniques that operate in Euclidean space, for example k-means clustering, not for hierarchical clustering.

It’s not a typical everyday human dilemma. It needs a bit more explanation.

Some background

Clustering (assuming everyone is happy with this technique but if not click here) typically works on a matrix of distances between data points. We sometimes refer to the distances as dissimilarities – the greater the distance the more dissimilar the data points. So in a simple case the data points might be customers and the distances reflect how different the customers are in terms of demographics, purchasing behaviour etc.

A simple way to achieve this would be to plot your data points against a set of axes representing the things you would like to include in the dissimilarity measure and then just measure the distance between the points. Scaling the data would ensure that none of the attributes are prioritised over the others. For example plot your customers by age and income (both standardised) and then measure the distance between them to determine how similar they are in age and income.

This is Euclidean distance and it is implicit in many popular and powerful clustering algorithms for example k-means and its variants

But how do you know if your measure of dissimilarity is representable as a Euclidean distance and therefore amenable to k-means? It’s simple enough if you started with some variables and derived your Euclidean distance but it doesn’t always work this way. For example suppose I am measuring similarity by correlation. Is the absolute of correlation representable as a Euclidean distance? Is there some n-dimensional space, where we could plot our data points such that the distance between them represented the absolute value of their correlation?

A quick check to see if your measure is ever going to be representable as euclidean distance is: does it satisfy the triangle inequality? $d(x,z) \le d(x,y) + d(y,z)$ where $d(x,y)$ is the distance between x and y.

For example love and hate (assuming they are quantifiable) do not satisfy the triangle equality. If x loves y and y loves z this places no constraint on the degree to which x loves z. It could easily be more than the sum of the love of x for y and y for z!

Continue reading at my new professional blog coppelia.io  R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...