Practical Guide to Cluster Analysis in R – Book
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Introduction
Large amounts of data are collected every day from satellite images, biomedical, security, marketing, web search, geospatial or other automatic equipment. Mining knowledge from these big data far exceeds human’s abilities.
Clustering is one of the important data mining methods for discovering knowledge in multidimensional data. The goal of clustering is to identify pattern or groups of similar objects within a data set of interest.
In the litterature, it is referred as “pattern recognition” or “unsupervised machine learning” – “unsupervised” because we are not guided by a priori ideas of which variables or samples belong in which clusters. “Learning” because the machine algorithm “learns” how to cluster.
Cluster analysis is popular in many fields, including:

In cancer research for classifying patients into subgroups according their gene expression profile. This can be useful for identifying the molecular profile of patients with good or bad prognostic, as well as for understanding the disease.

In marketing for market segmentation by identifying subgroups of customers with similar profiles and who might be receptive to a particular form of advertising.

In Cityplanning for identifying groups of houses according to their type, value and location.
Preview of the first 38 pages of the book: Practical Guide to Cluster Analysis in R (preview).
Download the ebook through payhip:
Order a physical copy from amazon:
Key features of this book
Although there are several good books on unsupervised machine learning/clustering and related topics, we felt that many of them are either too highlevel, theoretical or too advanced. Our goal was to write a practical guide to cluster analysis, elegant visualization and interpretation.
The main parts of the book include:
 distance measures,
 partitioning clustering,
 hierarchical clustering,
 cluster validation methods, as well as,
 advanced clustering methods such as fuzzy clustering, densitybased clustering and modelbased clustering.
The book presents the basic principles of these tasks and provide many examples in R. This book offers solid guidance in data mining for students and researchers.
Key features:
 Covers clustering algorithm and implementation
 Key mathematical concepts are presented
 Short, selfcontained chapters with practical examples. This means that, you don’t need to read the different chapters in sequence.
How this book is organized?
This book contains 5 parts. Part I (Chapter 1 – 3) provides a quick introduction to R (chapter 1) and presents required R packages and data format (Chapter 2) for clustering analysis and visualization.
The classification of objects, into clusters, requires some methods for measuring the distance or the (dis)similarity between the objects. Chapter 3 covers the common distance measures used for assessing similarity between observations.
Part II starts with partitioning clustering methods, which include:
 Kmeans clustering (Chapter 4),
 KMedoids or PAM (partitioning around medoids) algorithm (Chapter 5) and
 CLARA algorithms (Chapter 6).
Partitioning clustering approaches subdivide the data sets into a set of k groups, where k is the number of groups prespecified by the analyst.
In Part III, we consider agglomerative hierarchical clustering method, which is an alternative approach to partitionning clustering for identifying groups in a data set. It does not require to prespecify the number of clusters to be generated. The result of hierarchical clustering is a treebased representation of the objects, which is also known as dendrogram (see the figure below).
In this part, we describe how to compute, visualize, interpret and compare dendrograms:
 Agglomerative clustering (Chapter 7)
 Algorithm and steps
 Verify the cluster tree
 Cut the dendrogram into different groups
 Compare dendrograms (Chapter 8)
 Visual comparison of two dendrograms
 Correlation matrix between a list of dendrograms
 Visualize dendrograms (Chapter 9)
 Case of small data sets
 Case of dendrogram with large data sets: zoom, subtree, PDF
 Customize dendrograms using dendextend
 Heatmap: static and interactive (Chapter 10)
 R base heat maps
 Pretty heat maps
 Interactive heat maps
 Complex heatmap
 Real application: gene expression data
In this section, you will learn how to generate and interpret the following plots.
 Standard dendrogram with filled rectangle around clusters:
 Compare two dendrograms:
 Heatmap:
Part IV describes clustering validation and evaluation strategies, which consists of measuring the goodness of clustering results. Before applying any clustering algorithm to a data set, the first thing to do is to assess the clustering tendency. That is, whether applying clustering is suitable for the data. If yes, then how many clusters are there. Next, you can perform hierarchical clustering or partitioning clustering (with a prespecified number of clusters). Finally, you can use a number of measures, described in this chapter, to evaluate the goodness of the clustering results.
The different chapters included in part IV are organized as follow:

Assessing clustering tendency (Chapter 11)

Determining the optimal number of clusters (Chapter 12)

Cluster validation statistics (Chapter 13)

Choosing the best clustering algorithms (Chapter 14)

Computing pvalue for hierarchical clustering (Chapter 15)
In this section, you’ll learn how to create and interpret the plots hereafter.
 Visual assessment of clustering tendency (left panel): Clustering tendency is detected in a visual form by counting the number of square shaped dark blocks along the diagonal in the image.
 Determine the optimal number of clusters (right panel) in a data set using the gap statistics.
 Cluster validation using the silhouette coefficient (Si): A value of Si close to 1 indicates that the object is well clustered. A value of Si close to 1 indicates that the object is poorly clustered. The figure below shows the silhouette plot of a kmeans clustering.
Part V presents advanced clustering methods, including:
 Hierarchical kmeans clustering (Chapter 16)
 Fuzzy clustering (Chapter 17)
 Modelbased clustering (Chapter 18)
 DBSCAN: DensityBased Clustering (Chapter 19)
The hierarchical kmeans clustering is an hybrid approach for improving kmeans results.
In Fuzzy clustering, items can be a member of more than one cluster. Each item has a set of membership coefficients corresponding to the degree of being in a given cluster.
In modelbased clustering, the data are viewed as coming from a distribution that is mixture of two ore more clusters. It finds best fit of models to data and estimates the number of clusters.
The densitybased clustering (DBSCAN is a partitioning method that has been introduced in Ester et al. (1996). It can find out clusters of different shapes and sizes from data containing noise and outliers.
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.