Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Agglomerative hierarchical clustering is a simple, intuitive and well-understood method for clustering data points. I used it with good results in a project to estimate the true geographical position of objects based on measured estimates. With this tutorial I would like to describe the basics of this method, how to implement it in R with hclust and some ideas on how to decide where to cut the tree. This was also a great opportunity for composing anohter Shiny/D3.js app (GitHub, shinyapps.io) – something I wanted to do for a while now. At the end of the text I am writing a bit about what I learned in that regard.

# Agglomerative Hierarchical Clustering with Euclidian Metric

Let’s assume eight points:

> pts <- data.frame(x=c(1,1,2,3,2,3,4,4), y=c(1,2,4,4,5,5,1,2))
> plot(pts, pch=16, cex=2, xaxt="n", yaxt="n", xlim=c(0,5), ylim=c(0,6))
> axis(1, at=0:5)
> axis(2, at=0:6)

The decision which points to agglomerate into a cluster is based on distances between those points. Which kind of distance to choose on depends on what the data points represent. For this example I will choose the Euclidian distance. The necessity to calculate distances for all possible combinations of n points and to repeat this calculation potentially for every intermediate cluster places this family of algorithms somewhere between O(n²) and O(n³) and causes it to be potentially slow for very large data sets. But obviously this computation is parallelizable. So let’s start with computing the distance matrix which gives us the mutual euclidian distances.

> d <- dist(pts, method="euclidian", upper=TRUE)
> round(d,2)

1    2    3    4    5    6    7    8
1      1.00 3.16 3.61 4.12 4.47 3.00 3.16
2 1.00      2.24 2.83 3.16 3.61 3.16 3.00
3 3.16 2.24      1.00 1.00 1.41 3.61 2.83
4 3.61 2.83 1.00      1.41 1.00 3.16 2.24
5 4.12 3.16 1.00 1.41      1.00 4.47 3.61
6 4.47 3.61 1.41 1.00 1.00      4.12 3.16
7 3.00 3.16 3.61 3.16 4.47 4.12      1.00
8 3.16 3.00 2.83 2.24 3.61 3.16 1.00

Now agglomerative hierarcal clustering will combine the two closest clusters at each step until there is nothing left to be combined. In the beginning every single point is considered a cluster by itself. So in the above case for the first step six possible cluster combinations of minimal available distance 1 maybe chosen. Of course euclidian distance is only one option next to several others like squared euclidian or Mahattan distance. Another decision you have to make is how you choose the two points in two clusters between which to calculate the distance.

In the example to the right for euclidian distance and single linkage clustering the distance between the red and the blue cluster would be 2, which is the minimal distance for every combination of red and blue points. Complete linkage on the other hand measures the distance of two clusters as the maximum distance for two points – which is in this case 4.

This is why the initial example would lead to different trees upon different clustering methods:

pts <- data.frame(x=c(1,1,2,3,2,3,4,4), y=c(1,2,4,4,5,5,1,2))
d <- dist(pts, method="euclidian")
par(mfrow=c(1,2))
plot(hclust(d,method="single"))
plot(hclust(d,method="complete"))

# Inferring Clusters from a Tree

For complete linkage you can get either 8, 4, 3, 2 or 1 cluster depending on where to cut the tree. If you cat at height 4 you get two clusters – if you cut at 2 you’ll get three clusters. A human being would most likely see three clusters in above example. Why? Because the pattern invites the viewer to distinguish the observable distances into two categories – relatively close and relatively distant. Actually this reasoning can be very easily inferred from the trees above. The branching height is the distance between the two clusters the branching refers to. With complete linkage the clusters {3,4} and {5,6} have a distance of about 1.5 (actually it is sqrt(2)). {1,2} and {7,8} a distanced about 3 (actually it is sqrt(1² + 3²)). And if you think  of where to cut off the tree then the most sensible height would be between the two most distant branchings. Above it the “relatively distant” distances are found and below it the “relatively short” distances.

A way to visualize this idea would be by drawing a histogram or a density of the heights present in the tree. This density is displayed under “density of branching heights” in above screen shot. Something around the local minimum between the two modes would be reasonable choice for a cutting height.

This heuristic might be sufficient already but of course more complex techniques are conceivable. One might differentiate between different forkings on a high level or take the branching density into account. The latter idea is depicted under “num of clusters (y) when cutting at height (x)”. You would want to cut the tree at a point where below it suddenly a lot of branchings are taking place.

# Implementation of  Maximum Gap Heuristic

And this is how a tree cutting somewhere within the maximum branching heights gap might be implemented in R.

> pts <- data.frame(x=c(1,1,2,3,2,3,4,4), y=c(1,2,4,4,5,5,1,2))
> d <- dist(pts, method="euclidian")
> h <- hclust(d, method="complete")
> h$height [1] 1.000000 1.000000 1.000000 1.000000 1.414214 3.162278 4.472136 > > i <- which.max(diff(h$height))
> cut_height <- (h$height[i] + h$height[i+1])/2
> cut_height
[1] 2.288246
>
> clusters <- stats::cutree(h, h=cut_height)
> clusters
[1] 1 1 2 2 2 2 3 3
>
> par(mfrow=c(1,2))
> plot(h)
> abline(h=cut_height, col="red", lty=2)
> plot(pts, col=c("red","green","blue")[clusters], pch=16)

And this is what it would look like:

BTW – in case you want to decorate your dendrogram with those fancy dashed boxes to indicate the clustering then you have to look into Tal Galili’s dendextend package. It’s like a beauty parlor for dendrograms.

# Shiny and D3.js

If shinyapps.io is too slow for your liking you can also launch the app right from its GitHub repository by:

shiny::runGitHub("hclust-shiny", "joyofdata")

I really have to say Shiny is an awesome product pretty much in every regard. It is super-intuitive and easy to get started with – the reactivity-framework feels natural and allows you to implement client-server-interaction in very straight forward way. Definitely noteworthy is the awesome documentation providing bite-sized introductions into the various aspects of Shiny, featuring great examples.

The integration of Shiny with D3.js I realized by D3.js writing its state-changes into a hidden text input and then triggering a change event. Less straightforward seems to be influencing client-side JavaScript code from server-side Shiny. It seems that this is done by injecting JavaScript into a htmlOutput()-section. But maybe there is a better method – for this app I only had to inject CSS to color the clusters – so I didn’t put too much effort into solving this problem.

What I do not like so much is the layouting framework. Maybe I just didn’t use it optimally but it seems that it tends to use screen estate too lavishly. Given that the perceived coolness of an app is proportional to the number of plots and charts placed on the screen – you know cockpit-dashboard-style – effective layouting is something to consider. But Shiny offers you to tailor-make the app-interface yourself by simply writing the HTML, CSS and JavaScript yourself. So it’s not really a problem anyway.

Something else to keep in mind is that the reactivity framework makes it easy to program the client-server-communication but also makes it very easy to end up with Spaghetti-like entanglements of reactive in- and outputs. I guess for a larger app I would have to put some thinking into design patterns keeping the mess at bay.

(original article published on www.joyofdata.de)