Overview of clustering methods in R
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Clustering is a very popular technique in data science because of its unsupervised characteristic – we don’t need true labels of groups in data. In this blog post, I will give you a “quick” survey of various clustering methods applied to synthetic but also real datasets.
What is clustering?
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).
It is a technique of unsupervised learning, so clustering is used when no a priori information about data is available. This makes clustering a very strong technique for gaining insights into data and making more accurate decisions.
What is it good for?
Clustering is used for:
 To gain insight into data, generate hypotheses, detect anomalies, and identify salient features,
 To identify the degree of similarity among objects (i.e. organisms),
 As a method for organizing the data and summarising it through cluster prototypes (compression).
Classification to groups
The first use case is to group data, e.g. classify them into groups. For explanation purposes, I will generate synthetic data from three normal distributions plus three outliers (anomalies). Let’s load needed packages, generate randomly some data, and show the first use case in the visualization:
We can see three nicely divided groups of data.
Anomaly detection
Clustering can be also used as an anomaly detection technique, some methods of clustering can detect automatically outliers (anomalies). Let’s show visually what it looks like.
Data compression
In an era of a large amount of data (also many times used buzzword  big data), we have problems processing them in real time. Here clustering can help to reduce dimensionality by its compression feature. Created clusters, that incorporate multiple points (data), can be replaced by their representatives (prototypes)  so one point. In this way, points were replaced by its cluster representative (“+”):
Types of clustering methods
Since cluster analysis has been here for more than 50 years, there are a large amount of available methods. The basic classification of clustering methods is based on the objective to which they aim: hierarchical, nonhierarchical.
The hierarchical clustering is a multilevel partition of a dataset that is a branch of classification (clustering). Hierarchical clustering has two types of access to data. The first one, divisive clustering, starts with one big cluster that is then divided into smaller clusters. The second one, agglomerative clustering, starts with individual objects that are singleelement clusters, and then they are gradually merged. The whole process of hierarchical clustering can be expressed (visualized) as a dendrogram.
The nonhierarchical clustering is dividing a dataset into a system of disjunctive subsets (clusters) so that an intersection of clusters would be an empty set.
Clustering methods can be also divided in more detail based on the processes in the method (algorithm) itself:
Nonhierarchical:
 Centroidbased
 Modelbased
 Densitybased
 Gridbased
Hierarchical:
 Agglomerative
 Divisive
But which to choose in your use case? Let’s dive deeper into the most known methods and discuss their advantages and disadvantages.
Centroidbased
The most basic (maybe just for me) type of clustering method is centroidbased. This type of clustering creates prototypes of clusters  centroids or medoids.
The best wellknown methods are:
 Kmeans
 Kmedians
 Kmedoids
 Kmodes
Kmeans
Steps:
 Create random K clusters (and compute centroids).
 Assign points to the nearest centroids.
 Update centroids.
 Go to step 2 while the centroids are changing.
Pros and cons:
 [+] Fast to compute. Easy to understand.
 [] Various initial clusters can lead to different final clustering.
 [] Scaledependent.
 [] Creates only convex (spherical) shapes of clusters.
 [] Sensitive to outliers.
Kmeans  example
It is very easy to try Kmeans in R (by the kmeans
function), only needed parameter is a number of clusters.
We can see example, when Kmeans fails most often, so when there are outliers in the dataset.
Kmedoids
The problem with outliers solves Kmedoids because prototypes are medoids  members of the dataset. So, not artificially created centroids, which helps to tackle outliers.
Pros and cons:
 [+] Easy to understand.
 [+] Less sensitive to outliers.
 [+] Possibility to use any distance measure.
 [] Various initial clusters can lead to different final clustering.
 [] Scaledependent.
 [] Slower than Kmeans.
Kmedoids  example
Kmedoids problem can be solved by the Partition Around Medoids (PAM) algorithm (function pam
in cluster
package).
We can see that medoids stayed nicely in the three main groups of data.
The determination of the number of clusters
The disadvantage of centroidbased methods is that the number of clusters needs to be known in advance (it is a parameter of the methods). However, we can determine the number of clusters by its Internal validation (index). Basic steps are based on that we compute some internal validation index with many ( K ) and we choose ( K ) with the best index value.
Many indexes are there…
 Silhouette
 DaviesBouldin index
 Dunn index
 etc.
However, every index has similar characteristics:
\[\frac{withinclustersimilarity}{betweenclusterssimilarity} .\]so, it is the ratio of the average distances in clusters and between clusters.
Elbow diagram
The Elbow diagram is a simple method (rule) how to determine the number of clusters  we compute the internal index with a set of K and choose K where positive change is largest.
So for example, I chose the DaviesBouldin index implemented in the clusterCrit
package. For our simple dataset, I will generate clusterings with 26 number of clusters and compute the index.
We can see that the Elbow diagram rule chose 4 clusters  makes sense to me actually…
We can also try it with PAM  Kmedoids.
It is the same result.
Modelbased
Modelbased clustering methods are based on some probabilistic distribution. It can be:
 Gaussian normal distribution
 Gamma distribution
 Student’s tdistribution
 Poisson distribution
 etc.
Since we cluster multivariate data, modelbased clustering uses Multivariate distributions and a socalled Mixture of models (Mixtures > clusters). When using clustering with Gaussian normal distribution, we are using the theory of Gaussian Mixture Models (GMM).
GMM
The target is to maximize likelihood: \(L(\boldsymbol{\mu_1}, \dots, \boldsymbol{\mu_k}, \boldsymbol{\Sigma_1}, \dots, \boldsymbol{\Sigma_k}  \boldsymbol{x_1}, \dots, \boldsymbol{x_n}).\) Here, cluster is represented by mean (( \mathbf{\mu} )) and covariance matrix (( \mathbf{\Sigma} )). So not just centroid as in the case of Kmeans.
This optimization problem is typically solved by the EM algorithm (Expectation Maximization).
Pros and cons:
 [+] Ellipsoidal clusters,
 [+] Can be parameterized by covariance matrix,
 [+] Scaleindependent,
 [] Very slow for highdimensional data,
 [] Can be difficult to understand.
EM algorithm with GMM is implemented in the mclust
package. You can optimize various shapes of mixtures (clusters) by the modelNames
parameter (check the ?mclustModelNames
function for more details).
Pretty interesting red ellipse that was created, but generally clustering is OK.
BIC
The Bayesian Information Criterion (BIC) for choosing the optimal number of clusters can be used with modelbased clustering.
In the mclust
package, you can just add multiple modelNames
and it chooses by BIC the best one.
We can try also to vary the dependency of covariance matrix ( \mathbf{\Sigma} ).
The result:
So, the methodology chose 6 clusters  3 main groups of data and all 3 anomalies in separate clusters.
Densitybased
Densitybased clusters are based on maximally connected components of the set of points that lie within some defined distance from some core object.
Methods:
DBSCAN
In the wellknown method DBSCAN, density is defined as neighborhood, where points have to be reachable within a defined distance (( \epsilon ) distance  first parameter of the method), however, clusters must have at least some number of minimal points (second parameter of the method). Points that weren’t connected with any cluster and did not pass the minimal points criterion are marked as noise (outliers).
Pros and cons:
 [+] Extracts automatically outliers,
 [+] Fast to compute,
 [+] Can find clusters of arbitrary shapes,
 [+] The number of clusters is determined automatically based on data,
 [] Parameters (( \epsilon ), minPts) must be set by a practitioner,
 [] Possible problem with neighborhoods  can be connected.
DBSCAN is implemented in the same named function and package, so let’s try it.
We can see that DBSCAN found 3 clusters and 3 outliers correctly when parameters are wisely chosen.
Bananas  DBSCAN result
To demonstrate the strength of DBSCAN, researchers created many dummy artificial datasets, which are many times called bananas.
Bananas  Kmeans result
Kmeans here is not a good choice obviously…but these datasets are far from realworld either.
Spectral clustering
Spectral clustering methods are based on the spectral decomposition of data, so the creation of eigen vectors and eigen values.
Steps:
 N = number of data, d = dimension of data,
 ( \mathbf{A} ) = affinity matrix, ( A_{ij} = \exp( (data_i  data_j)^2 / (2*\sigma^2) ) )  N by N matrix,
 ( \mathbf{D} ) = diagonal matrix whose (i,i)element is the sum of ( \mathbf{A} ) ith row  N by N matrix,
 ( \mathbf{L} ) = ( \mathbf{D}^{1/2} \mathbf{A} \mathbf{D}^{1/2} )  N by N matrix,
 ( \mathbf{X} ) = union of k largest eigenvectors of ( \mathbf{L} )  N by k matrix,
 Renormalising each of ( \mathbf{X} ) rows to have unit length  N by k matrix,
 Run Kmeans algorithm on ( \mathbf{X} ).
Typical use case for spectral clustering
We will try spectral clustering on the Spirals artificial dataset.
Spectral clustering is implemented in the kernlab
package and specc
function.
Let’s it try on more advanced data  compound data.
This is not a good result, let’s try DBSCAN.
Again, the nice result for DBSCAN on the artificial dataset.
Hierarchical clustering
The result of a hierarchical clustering is a dendrogram. The dendrogram can be cut at any height to form a partition of the data into clusters. How data points are connected in the dendrogram has multiple possible ways (linkages) and criteria:
 Singlelinkage
 Completelinkage
 Averagelinkage
 Centroidlinkage
 Ward’s minimum variance method
 etc.
Criteria:
 singlelinkage: ( \min { d(a,b):a\in A, b\in B } )
 completelinkage: ( \max { d(a,b):a\in A, b\in B } )

averagelinkage: ( \frac{1}{ A B }\sum_{a\in A}\sum_{b\in B}d(a,b) ) 
centroidlinkage: ( c_t  c_s ), where ( c_s ) and ( c_t ) are the centroids of clusters ( s ) and ( t ).
where ( d(a,b) ) is the distance between points ( a ) and ( b ).
IRIS dataset use case
Let’s try hierarchical clustering on the IRIS dataset.
Single linkage:
Complete linkage:
Average linkage:
Let’s compute the precision of these three clusterings with the clusterCrit
package:
The best results were obtained with average linkage with precision of 81.9%.
Connected data
I have prepared for you the last use case for most shown methods where data (and clusters) are closely connected, so the closest scenario of real data.
DBSCAN  result for connected data
Chosen parameters are ( \epsilon = 0.08 ), ( minPts = 18 ).
The result is quite good enough, where the main three core groups are identified. Let’s change minPts to 10.
We can see a use case where DBSCAN is very sensitive to parameter settings, and you have to be very careful with some automatic settings of these parameters (in your use cases).
Kmeans  result for connected data
Nice result to be fair for this simple method.
Gaussian modelbased clustering result
Almost perfect result, but due to the normality of sampled data.
Spectral clustering result
Very nice result again!
Other types of clustering methods
Other types of clustering methods that were not covered in this blog post are:
 Gridbased
 Subspace clustering
 Multiview clustering
 Based on artificial neural networks (e.g. SOM)
 Consensus (ensemble) clustering
 Data stream(s) clustering
 etc.
Conclusions
 We have many types of clustering methods
 Different datasets need different clustering methods
 Automatic determination of a number of clusters can be tricky
 Real datasets are usually connected  densitybased methods can fail
 Outliers (anomalies) can significantly influence clustering results  the solution is to preprocess the data or use densitybased clustering
 Some methods are not suited for large (or highdimensional) datasets  modelbased or spectral clustering
 Some methods are not suited for nonconvex clusters  Kmeans, basic hierarchical clustering
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.