Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Illustration from Project Gutenberg

The goal of cluster analysis is to group the observations in the data into clusters such that every datum in a cluster is more similar to other datums in the same cluster than it is to datums in other clusters. This is an analysis method of choice when annotated training data is not readily available. In this article, based on chapter 8 of Practical Data Science with R, the authors discuss one approach to evaluating the clusters that are discovered by a chosen clustering method.

An important question when evaluating clusters is whether a given cluster is “real”-does the cluster represent actual structure in the data, or is it an artifact of the clustering algorithm? This is especially important with clustering algorithms like k-means, where the user has to specify the number of clusters a priori. It’s been our experience that clustering algorithms will often produce several clusters that represent actual structure or relationships in the data, and then one or two clusters that are buckets that represent “other” or “miscellaneous.” Clusters of “other” tend to be made up of data points that have no real relationship to each other; they just don’t fit anywhere else.

One way to assess whether a cluster represents true structure is to see if the cluster holds up under plausible variations in the dataset. The `fpc `package has a function called `clusterboot()`that uses bootstrap resampling to evaluate how stable a given cluster is. (For a full description of the algorithm, see Christian Henning, “Cluster-wise assessment of cluster stability,” Research Report 271, Dept. of Statistical Science, University College London, December 2006). `clusterboot()` is an integrated function that both performs the clustering and evaluates the final produced clusters. It has interfaces to a number of R clustering algorithms, including both `hclust `and `kmeans`.

`clusterboot`‘s algorithm uses the Jaccard coefficient, a similarity measure between sets. The Jaccard similarity between two sets A and B is the ratio of the number of elements in the intersection of A and B over the number of elements in the union of A and B. The basic general strategy is as follows:

1. Cluster the data as usual.
2. Draw a new dataset (of the same size as the original) by resampling the original dataset with replacement (meaning that some of the data points may show up more than once, and others not at all). Cluster the new dataset.
3. For every cluster in the original clustering, find the most similar cluster in the new clustering (the one that gives the maximum Jaccard coefficient) and record that value. If this maximum Jaccard coefficient is less than 0.5, the original cluster is considered to be dissolved-it didn’t show up in the new clustering. A cluster that’s dissolved too often is probably not a “real” cluster.
4. Repeat steps 2-3 several times.

The cluster stability of each cluster in the original clustering is the mean value of its Jaccard coefficient over all the bootstrap iterations. As a rule of thumb, clusters with a stability value less than 0.6 should be considered unstable. Values between 0.6 and 0.75 indicate that the cluster is measuring a pattern in the data, but there isn’t high certainty about which points should be clustered together. Clusters with stability values above about 0.85 can be considered highly stable (they’re likely to be real clusters).

Different clustering algorithms can give different stability values, even when the algorithms produce highly similar clusterings, so `clusterboot()` is also measuring how stable the clustering algorithm is.

The protein dataset

To demonstrate `clusterboot()`, we’ll use a small dataset from 1973 on protein consumption from nine different food groups in 25 countries in Europe. The original dataset can be found here. A tab-separated text file with the data can be found in this directory. The data file is called `protein.txt`; additional information can be found in the file `protein_README.txt.`

The goal is to group the countries based on patterns in their protein consumption. The dataset is loaded into R as a data frame called `protein`, as shown in the next listing.

```protein <- read.table("protein.txt", sep="t", header=TRUE)
summary(protein)
#           Country      RedMeat         WhiteMeat           Eggs
# Albania       : 1   Min.   : 4.400   Min.   : 1.400   Min.   :0.500
# Austria       : 1   1st Qu.: 7.800   1st Qu.: 4.900   1st Qu.:2.700
# Belgium       : 1   Median : 9.500   Median : 7.800   Median :2.900
# Bulgaria      : 1   Mean   : 9.828   Mean   : 7.896   Mean   :2.936
# Czechoslovakia: 1   3rd Qu.:10.600   3rd Qu.:10.800   3rd Qu.:3.700
# Denmark       : 1   Max.   :18.000   Max.   :14.000   Max.   :4.700
# (Other)       :19
#      Milk            Fish           Cereals          Starch
# Min.   : 4.90   Min.   : 0.200   Min.   :18.60   Min.   :0.600
# 1st Qu.:11.10   1st Qu.: 2.100   1st Qu.:24.30   1st Qu.:3.100
# Median :17.60   Median : 3.400   Median :28.00   Median :4.700
# Mean   :17.11   Mean   : 4.284   Mean   :32.25   Mean   :4.276
# 3rd Qu.:23.30   3rd Qu.: 5.800   3rd Qu.:40.10   3rd Qu.:5.700
# Max.   :33.70   Max.   :14.200   Max.   :56.70   Max.   :6.500
#
#      Nuts           Fr.Veg
# Min.   :0.700   Min.   :1.400
# 1st Qu.:1.500   1st Qu.:2.900
# Median :2.400   Median :3.800
# Mean   :3.072   Mean   :4.136
# 3rd Qu.:4.700   3rd Qu.:4.900
# Max.   :7.800   Max.   :7.900

#   Use all the columns except the first
#   (Country).
vars.to.use <- colnames(protein)[-1]

# Scale the data columns to be zero mean
# and unit variance.
# The output of scale() is a matricx.
pmatrix <- scale(protein[,vars.to.use])

# optionally, store the centers and
# standard deviations of the original data,
# so you can "unscale" it later.
pcenter <- attr(pmatrix, "scaled:center")
pscale <- attr(pmatrix, "scaled:scale")
```

Before running `clusterboot()` we’ll cluster the data using a hierarchical clustering algorithm (Ward’s method):

```#   Create the distance matrix.
d <- dist(pmatrix, method="euclidean")

#   Do the clustering.
pfit <- hclust(d, method="ward")

#   Plot the dendrogram.
plot(pfit, labels=protein\$Country)
```

The dendrogram suggests five clusters, as shown in the figure below. You can draw the rectangles on the dendrogram using the command `rect.hclust(pfit, k=5)`.

Let’s extract and print the clusters:

```#   A convenience function for printing out the
#   countries in each cluster, along with the values
#   for red meat, fish, and fruit/vegetable
#   consumption.
print_clusters <- function(labels, k) {
for(i in 1:k) {
print(paste("cluster", i))
print(protein[labels==i,c("Country","RedMeat","Fish","Fr.Veg")])
}
}

# get the cluster labels
groups <- cutree(pfit, k=5)

# --- results --

> print_clusters(groups, 5)
[1] "cluster 1"
Country RedMeat Fish Fr.Veg
1     Albania    10.1  0.2    1.7
4    Bulgaria     7.8  1.2    4.2
18    Romania     6.2  1.0    2.8
25 Yugoslavia     4.4  0.6    3.2
[1] "cluster 2"
Country RedMeat Fish Fr.Veg
2      Austria     8.9  2.1    4.3
3      Belgium    13.5  4.5    4.0
9       France    18.0  5.7    6.5
12     Ireland    13.9  2.2    2.9
14 Netherlands     9.5  2.5    3.7
21 Switzerland    13.1  2.3    4.9
22          UK    17.4  4.3    3.3
24   W Germany    11.4  3.4    3.8
[1] "cluster 3"
Country RedMeat Fish Fr.Veg
5  Czechoslovakia     9.7  2.0    4.0
7       E Germany     8.4  5.4    3.6
11        Hungary     5.3  0.3    4.2
16         Poland     6.9  3.0    6.6
23           USSR     9.3  3.0    2.9
[1] "cluster 4"
Country RedMeat Fish Fr.Veg
6  Denmark    10.6  9.9    2.4
8  Finland     9.5  5.8    1.4
15  Norway     9.4  9.7    2.7
20  Sweden     9.9  7.5    2.0
[1] "cluster 5"
Country RedMeat Fish Fr.Veg
10   Greece    10.2  5.9    6.5
13    Italy     9.0  3.4    6.7
17 Portugal     6.2 14.2    7.9
19    Spain     7.1  7.0    7.2
```

There’s a certain logic to these clusters: the countries in each cluster tend to be in the same geographical region. It makes sense that countries in the same region would have similar dietary habits. You can also see that

• Cluster 2 is made of countries with higher-than-average red meat consumption.
• Cluster 4 contains countries with higher-than-average fish consumption but low produce consumption.
• Cluster 5 contains countries with high fish and produce consumption.

Let’s run `clusterboot()` on the protein data, using hierarchical clustering with five clusters.

```# load the fpc package
library(fpc)

# set the desired number of clusters
kbest.p<-5

#   Run clusterboot() with hclust
#   ('clustermethod=hclustCBI') using Ward's method
#   ('method="ward"') and kbest.p clusters
#   ('k=kbest.p'). Return the results in an object
#   called cboot.hclust.
cboot.hclust <- clusterboot(pmatrix,clustermethod=hclustCBI,
method="ward", k=kbest.p)

#   The results of the clustering are in
#   cboot.hclust\$result. The output of the hclust()
#   function is in cboot.hclust\$result\$result.
#
#   cboot.hclust\$result\$partition returns a
#   vector of clusterlabels.
groups<-cboot.hclust\$result\$partition

# -- results --

> print_clusters(groups, kbest.p)
[1] "cluster 1"
Country RedMeat Fish Fr.Veg
1     Albania    10.1  0.2    1.7
4    Bulgaria     7.8  1.2    4.2
18    Romania     6.2  1.0    2.8
25 Yugoslavia     4.4  0.6    3.2
[1] "cluster 2"
Country RedMeat Fish Fr.Veg
2      Austria     8.9  2.1    4.3
3      Belgium    13.5  4.5    4.0
9       France    18.0  5.7    6.5
12     Ireland    13.9  2.2    2.9
14 Netherlands     9.5  2.5    3.7
21 Switzerland    13.1  2.3    4.9
22          UK    17.4  4.3    3.3
24   W Germany    11.4  3.4    3.8
[1] "cluster 3"
Country RedMeat Fish Fr.Veg
5  Czechoslovakia     9.7  2.0    4.0
7       E Germany     8.4  5.4    3.6
11        Hungary     5.3  0.3    4.2
16         Poland     6.9  3.0    6.6
23           USSR     9.3  3.0    2.9
[1] "cluster 4"
Country RedMeat Fish Fr.Veg
6  Denmark    10.6  9.9    2.4
8  Finland     9.5  5.8    1.4
15  Norway     9.4  9.7    2.7
20  Sweden     9.9  7.5    2.0
[1] "cluster 5"
Country RedMeat Fish Fr.Veg
10   Greece    10.2  5.9    6.5
13    Italy     9.0  3.4    6.7
17 Portugal     6.2 14.2    7.9
19    Spain     7.1  7.0    7.2

# The vector of cluster stabilities.
# Values close to 1 indicate stable clusters
> cboot.hclust\$bootmean
[1] 0.7905000 0.7990913 0.6173056 0.9312857 0.7560000

# The count of how many times each cluster was
# dissolved. By default clusterboot() runs 100
# bootstrap iterations.
# Clusters that are dissolved often are unstable.
> cboot.hclust\$bootbrd
[1] 25 11 47  8 35
```

The above results show that the cluster of countries with high fish consumption (cluster 4) is highly stable (cluster stability = 0.93). Clusters 1 and 2 are also quite stable; cluster 5 (cluster stability 0.76) less so. Cluster 3 (cluster stability 0.62) has the characteristics of what we’ve been calling the “other” cluster. Note that `clusterboot()` has a random component, so the exact stability values and number of times each cluster is dissolved will vary from run to run.

Based on these results, we can say that the countries in cluster 4 have highly similar eating habits, distinct from those of the other countries (high fish and red meat consumption, with a relatively low amount of fruits and vegetables); we can also say that the countries in clusters 1 and 2 represent distinct eating patterns as well. The countries in cluster 3, on the other hand, show eating patterns that are different from those of the countries in other clusters, but aren’t as strongly similar to each other.

The `clusterboot()` algorithm assumes that you have the number of clusters, k. Obviously, determining k will be harder for datasets that are larger than our protein example. There are ways of estimating k, but they are beyond the scope of this article. Once you have an idea of the number of clusters, however, `clusterboot()` is a useful tool for evaluating the strength of the patterns that you have discovered.

For more about clustering, please refer to our free sample chapter 8 of Practical Data Science with R.