Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Clustering algorithms are used to split a dataset into several groups (i.e clusters), so that the objects in the same group are as similar as possible and the objects in different groups are as dissimilar as possible.

The most popular clustering algorithms are:

However, each of these two standard clustering methods has its limitations. K-means clustering requires the user to specify the number of clusters in advance and selects initial centroids randomly. Agglomerative hierarchical clustering is good at identifying small clusters but not large ones.

In this article, we document hybrid approaches for easily mixing the best of k-means clustering and hierarchical clustering.

We’ll start by demonstrating why we should combine k-means and hierarcical clustering. An application is provided using R software.

Finally, we’ll provide an easy to use R function (in factoextra package) for computing hybrid hierachical k-means clustering.

# 2 Required R packages

We’ll use the R package factoextra which is very helpful for simplifying clustering workflows and for visualizing clusters using ggplot2 plotting system

Install factoextra package as follow:

if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")

library(factoextra)

# 3 Data preparation

We’ll use USArrest dataset and we start by scaling the data:

# Load the data
data(USArrests)
# Scale the data
df <- scale(USArrests)
##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

If you want to understand why the data are scaled before the analysis, then you should read this section: Distances and scaling.

# 4 R function for clustering analyses

We’ll use the function eclust() [in factoextra] which provides several advantages as described in the previous chapter: Visual Enhancement of Clustering Analysis.

eclust() stands for enhanced clustering. It simplifies the workflow of clustering analysis and, it can be used for computing hierarchical clustering and partititioning clustering in a single line function call.

## 4.1 Example of k-means clustering

We’ll split the data into 4 clusters using k-means clustering as follow:

library("factoextra")
# K-means clustering
km.res <- eclust(df, "kmeans", k = 4,
nstart = 25, graph = FALSE)
# k-means group number of each observation
head(km.res$cluster, 15) ## Alabama Alaska Arizona Arkansas California Colorado ## 4 3 3 4 3 3 ## Connecticut Delaware Florida Georgia Hawaii Idaho ## 2 2 3 4 2 1 ## Illinois Indiana Iowa ## 3 2 1 # Visualize k-means clusters fviz_cluster(km.res, frame.type = "norm", frame.level = 0.68) # Visualize the silhouette of clusters fviz_silhouette(km.res) ## cluster size ave.sil.width ## 1 1 13 0.37 ## 2 2 16 0.34 ## 3 3 13 0.27 ## 4 4 8 0.39 Note that, silhouette coefficient measures how well an observation is clustered and it estimates the average distance between clusters (i.e, the average silhouette width). Observations with negative silhouette are probably placed in the wrong cluster. Read more here: cluster validation statistics Samples with negative silhouette coefficient: # Silhouette width of observation sil <- km.res$silinfo$widths[, 1:3] # Objects with negative silhouette neg_sil_index <- which(sil[, 'sil_width'] < 0) sil[neg_sil_index, , drop = FALSE] ## cluster neighbor sil_width ## Missouri 3 2 -0.07318144 Read more about k-means clustering: K-means clustering ## 4.2 Example of hierarchical clustering # Enhanced hierarchical clustering res.hc <- eclust(df, "hclust", k = 4, method = "ward.D2", graph = FALSE) head(res.hc$cluster, 15)
##           1           2           2           3           2           2
## Connecticut    Delaware     Florida     Georgia      Hawaii       Idaho
##           3           3           2           1           3           4
##    Illinois     Indiana        Iowa
##           2           3           4
# Dendrogram
fviz_dend(res.hc, rect = TRUE, show_labels = TRUE, cex = 0.5) 

# Visualize the silhouette of clusters
fviz_silhouette(res.hc)
##   cluster size ave.sil.width
## 1       1    7          0.46
## 2       2   12          0.29
## 3       3   19          0.26
## 4       4   12          0.43

It can be seen that three samples have negative silhouette coefficient indicating that they are not in the right cluster. These samples are:

# Silhouette width of observation
sil <- res.hc$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]
##          cluster neighbor   sil_width
## Kentucky       3        4 -0.06459230
## Arkansas       3        1 -0.08467352

# 5 Combining hierarchical clustering and k-means

## 5.1 Why?

Recall that, in k-means algorithm, a random set of observations are chosen as the initial centers.

The final k-means clustering solution is very sensitive to this initial random selection of cluster centers. The result might be (slightly) different each time you compute k-means.

To avoid this, a solution is to use an hybrid approach by combining the hierarchical clustering and the k-means methods. This process is named hybrid hierarchical k-means clustering (hkmeans).

## 5.2 How ?

The procedure is as follow:

1. Compute hierarchical clustering and cut the tree into k-clusters
2. compute the center (i.e the mean) of each cluster
3. Compute k-means by using the set of cluster centers (defined in step 3) as the initial cluster centers

Note that, k-means algorithm will improve the initial partitioning generated at the step 2 of the algorithm. Hence, the initial partitioning can be slightly different from the final partitioning obtained in the step 4.

## 5.3 R codes

### 5.3.1 Compute hierarchical clustering and cut the tree into k-clusters:

res.hc <- eclust(df, "hclust", k = 4,
method = "ward.D2", graph = FALSE)
# km.res2$cluster: Final clusters defined using k-means table(km.res2$cluster, res.hc$cluster) ## ## 1 2 3 4 ## 1 7 0 1 0 ## 2 0 12 1 0 ## 3 0 0 16 0 ## 4 0 0 1 12 It can be seen that, 3 of the observations defined as belonging to cluster 3 by hierarchical clustering has been reclassified to cluster 1, 2, and 4 in the final solution defined by k-means clustering. The difference can be easily visualized using the function fviz_dend() [in factoextra]. The labels are colored using k-means clusters: fviz_dend(res.hc, k = 4, k_colors = c("black", "red", "blue", "green3"), label_cols = km.res2$cluster[res.hc$order], cex = 0.6) It can be seen that the hierarchical clustering result has been improved by the k-means algorithm. ### 5.3.5 Compare the results of standard k-means clustering and hybrid approach # Final clusters defined using hierarchical k-means clustering km.clust <- km.res$cluster

# Standard k-means clustering
set.seed(123)
res.km <- kmeans(df, centers = 4, iter.max = 100)

# comparison
table(km.clust, res.km\$cluster)
##
## km.clust  1  2  3  4
##        1 13  0  0  0
##        2  0 16  0  0
##        3  0  0 13  0
##        4  0  0  0  8

In our current example, there was no further improvement of the k-means clustering result by the hybrid approach. An improvement might be observed using another dataset.

## 5.4 hkmeans(): Easy-to-use function for hybrid hierarchical k-means clustering

The function hkmeans() [in factoextra] can be used to compute easily the hybrid approach of k-means on hierarchical clustering. The format of the result is similar to the one provided by the standard kmeans() function.

# Compute hierarchical k-means clustering
res.hk <-hkmeans(df, 4)
# Elements returned by hkmeans()
names(res.hk)
##  [1] "cluster"      "centers"      "totss"        "withinss"
##  [5] "tot.withinss" "betweenss"    "size"         "iter"
##  [9] "ifault"       "data"         "hclust"
# Print the results
res.hk
## Hierarchical K-means clustering with 4 clusters of sizes 8, 13, 16, 13
##
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1  1.4118898  0.8743346 -0.8145211  0.01927104
## 2  0.6950701  1.0394414  0.7226370  1.27693964
## 3 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 4 -0.9615407 -1.1066010 -0.9301069 -0.96676331
##
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California
##              1              2              2              1              2
##       Colorado    Connecticut       Delaware        Florida        Georgia
##              2              3              3              2              1
##         Hawaii          Idaho       Illinois        Indiana           Iowa
##              3              4              2              3              4
##         Kansas       Kentucky      Louisiana          Maine       Maryland
##              3              4              1              4              2
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri
##              3              2              4              1              2
##              4              4              2              4              3
##     New Mexico       New York North Carolina   North Dakota           Ohio
##              2              2              1              4              3
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina
##              3              3              3              3              1
##   South Dakota      Tennessee          Texas           Utah        Vermont
##              4              1              2              3              4
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming
##              3              3              4              4              3
##
## Within cluster sum of squares by cluster:
## [1]  8.316061 19.922437 16.212213 11.952463
##  (between_SS / total_SS =  71.2 %)
##
## Available components:
##
##  [1] "cluster"      "centers"      "totss"        "withinss"
##  [5] "tot.withinss" "betweenss"    "size"         "iter"
##  [9] "ifault"       "data"         "hclust"
# Visualize the tree
fviz_dend(res.hk, cex = 0.6, rect = TRUE)

# Visualize the hkmeans final clusters
fviz_cluster(res.hk, frame.type = "norm", frame.level = 0.68)

# 6 Infos

This analysis has been performed using R software (ver. 3.2.4)