Homogeneity analysis of hierarchical classifications

August 10, 2010

(This article was first published on Last Resort Software, and kindly contributed to R-bloggers)

I’ve spent more years than I care to remember analysing vegetation survey data (typically species abundances in plots) using a variety of software including my own algorithms coded in FORTRAN and C++. A recent query on the r-help list, about how to determine the number of groups to define in a hierarchical classification produced with the hclust function, prompted me to unearth one of these algorithms, homogeneity analysis1, which can help to visualize how different levels of grouping partition the variability in a distance matrix.

This algorithm is extremely simple. The classification is progressively divided into groups, with all groups being defined at the same dendrogram height. At each level of grouping, the average of within-group pairwise distances is calculated. Homogeneity is then defined as:

H = 1 - Davwithin-group - Davtotal

where Davtotal is the average pairwise distance in the dataset as a whole.

For data were there exist well-defined clusters of values, a plot of homogeneity against number of groups will display an ‘elbow’ where the initial rapid increase in homogeneity turns to a more gradual increase. The example above shows a classification of the USArrests dataset and corresponding homogeneity plot which suggests defining 7 groups. It was generated as follows:

> d <- dist(USArrests)
> hc <- hclust(d, method="average")
> h <- hmg(d, hc)
> plot(h, type="s", main="USArrests homogeneity")
> abline(v=7, lty=2)
> plot(hc, labels=FALSE, main="USArrests dendrogram")
> rect.hclust(hc, k=7)

Here is the code for the hmg function:

function(d, hc) {

# R version of homogeneity analysis as described in:
# Bedward, Pressey and Keith. 1992. Homogeneity analysis: assessing the
# utility of classifications and maps of natural resources
# Australian Journal of Ecology 17:133-139.
# Arguments:
# d - either an object produced by dist() or a vector of
# pairwise dissimilarity values ordered in the manner of
# a dist result
# hc - classification produced by hclust()
# Value:
# A two column matrix of number of groups and corresponding homogeneity value
# This code by Michael Bedward, 2010

m <- hc$merge

# check args for consistency
Nobj <- nrow(m) + 1
if (length(d) != Nobj * (Nobj-1) / 2) {
dname <- dparse(substitute(d))
hcname <- dparse(substitute(hc))
stop(paste(dname, "and", hcname, "refer to different numbers of objects"))

distMean <- mean(d)
grp <- matrix(c(1:Nobj, rep(0, Nobj)), ncol=2)

# helper function - decodes values in the merge matrix
# from hclust
getGrp <- function( idx ) {
ifelse( idx < 0, grp[-idx], getGrp( m[idx, 1] ) )

grpDistTotal <- numeric(Nobj)
grpDistCount <- numeric(Nobj)
hmg <- numeric(Nobj)
hmg[1] <- 0
hmg[Nobj] <- 1

distTotal <- 0
distCount <- 0
for (step in 1:(Nobj-2)) {
mrec <- m[step, ]
grp1 <- getGrp(mrec[1])
grp2 <- getGrp(mrec[2])

distTotal <- distTotal - grpDistTotal[grp1] - grpDistTotal[grp2]
distCount <- distCount - grpDistCount[grp1] - grpDistCount[grp2]

grpDistTotal[grp1] <- grpDistTotal[grp1] + grpDistTotal[grp2]
grpDistCount[grp1] <- grpDistCount[grp1] + grpDistCount[grp2]

grp1Obj <- which(grp == grp1)
grp2Obj <- which(grp == grp2)

for (i in grp1Obj) {
for (j in grp2Obj) {
ii <- min(i,j)
jj <- max(i,j)
distIdx <- Nobj*(ii-1) - ii*(ii-1)/2 + jj-ii
grpDistTotal[grp1] <- grpDistTotal[grp1] + d[distIdx]
grpDistCount[grp1] <- grpDistCount[grp1] + 1

# update group vector
grp[grp == grp2] <- grp1

distTotal <- distTotal + grpDistTotal[grp1]
distCount <- distCount + grpDistCount[grp1]

hmg[Nobj - step] <- 1 - distTotal / ( distCount * distMean )

out <- matrix(c(1:Nobj, hmg), ncol=2)
colnames(out) <- c("ngroups", "homogeneity")

1M. Bedward, D. A. Keith, R. L. Pressey. 1992. Homogeneity analysis: Assessing the utility of classifications and maps of natural resources. Australian Journal of Ecology 17(2) 133-139.

To leave a comment for the author, please follow the link and comment on their blog: Last Resort Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training



CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)