# How to determine the number of Clusters for K-Means in R

**R – Predictive Hacks**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We will work with the Breast Cancer Wisconsin dataset, where we will apply the K-Means algorithm to the individual’s features ignoring the dependent variable `diagnosis`

. Notice that all the features are numeric.

library(tidyverse) # the column names of the dataset names <- c('id_number', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean','concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst') # get the data from the URL and assign the column names df<-read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"), col.names=names)

**Scale your Data**

Before we apply any cluster analysis, we should scale our data. We will remove the `id_number`

and the `diagnosis`

scaled_data<-as.data.frame(scale(df%>%select(-id_number, -diagnosis)))

**Elbow Method**

In a previous post, we explained how we can apply the Elbow Method in Python. Here, we will use the `map_dbl`

to run `kmeans`

using the `scaled_data`

for `k `

values ranging from 1 to 10 and extract the total **within-cluster sum of squares** value from each model. Then we can visualize the relationship using a line plot to create the elbow plot where we are looking for a sharp decline from one k to another followed by a more gradual decrease in slope. The last value of k before the slope of the plot levels off suggests a “good” value of k.

# Use map_dbl to run many models with varying value of k (centers) tot_withinss <- map_dbl(1:10, function(k){ model <- kmeans(x = scaled_data, centers = k) model$tot.withinss }) # Generate a data frame containing both k and tot_withinss elbow_df <- data.frame( k = 1:10, tot_withinss = tot_withinss ) # Plot the elbow plot ggplot(elbow_df, aes(x = k, y = tot_withinss)) + geom_line() + geom_point()+ scale_x_continuous(breaks = 1:10)

According to the Elbow Method, we can argue that the number of suggested K Clusters are **2**.

**Silhouette Analysis**

Silhouette analysis allows you to calculate how similar each observation is with the cluster it is assigned relative to other clusters. This metric ranges from **-1** to **1** for each observation in your data and can be interpreted as follows:

- Values close to
**1**suggest that the observation is well matched to the assigned cluster - Values close to
**0**suggest that the observation is borderline matched between two clusters - Values close to
**-1**suggest that the observations may be assigned to the wrong cluster

We can determine the number of clusters K using the average silhouette width. We pick the K which maximizes that score.

# Use map_dbl to run many models with varying value of k sil_width <- map_dbl(2:10, function(k){ model <- pam(x = scaled_data, k = k) model$silinfo$avg.width }) # Generate a data frame containing both k and sil_width sil_df <- data.frame( k = 2:10, sil_width = sil_width ) # Plot the relationship between k and sil_width ggplot(sil_df, aes(x = k, y = sil_width)) + geom_line() + geom_point() + scale_x_continuous(breaks = 2:10)

As we can see from the plot above, the** “Best” k is 2**

**Gap Statistic**

The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering). The reference dataset is generated using Monte Carlo simulations of the sampling process

library(factoextra) library(cluster) # compute gap statistic set.seed(123) gap_stat <- clusGap(scaled_data, FUN = kmeans, nstart = 25, K.max = 10, B = 50) fviz_gap_stat(gap_stat)

Again, according to the **Gap Statistic**, the optimum number of clusters is the **k=2**.

**All 3 methods in one package**

Let’s see how we can produce the same analysis for the three methods above with a few lines of coding!

library(factoextra) library(NbClust) # Elbow method fviz_nbclust(scaled_data, kmeans, method = "wss") + geom_vline(xintercept = 2, linetype = 2)+ labs(subtitle = "Elbow method") # Silhouette method fviz_nbclust(scaled_data, kmeans, method = "silhouette")+ labs(subtitle = "Silhouette method") # Gap statistic # nboot = 50 to keep the function speedy. # recommended value: nboot= 500 for your analysis. # Use verbose = FALSE to hide computing progression. set.seed(123) fviz_nbclust(scaled_data, kmeans, nstart = 25, method = "gap_stat", nboot = 50)+ labs(subtitle = "Gap statistic method")

**Visualize the K-Means**

Since we determined that the number of clusters should be 2, then we can run the k-means algorithm with k=2. Let’s visualize our data into two dimensions.

fviz_cluster(kmeans(scaled_data, centers = 2), geom = "point", data = scaled_date)

**Clusters and Classes in the same plot**

Based on the analysis above, the suggested number of clusters in K-means was 2. Bear in mind that in our dataset we have also the dependent variable `diagnosis`

which takes values `B`

and `M`

. Let’s represent at the same plot the `Clusters (k=2)`

and the `Classes (B,M)`

.

We will apply **PCA **by keeping the first two PCs.

# get the PCA of the scaled data pca_res <- prcomp(scaled_data) # add the Cluster to the original data frame df$cluster<-as.factor(kmeans(scaled_date, centers = 2)$cluster) # add the PC1 and PC2 to the original data frame df$PC1<-pca_res$x[,1] df$PC2<-pca_res$x[,2] # do a scatter plot of PC1, PC2 with a color of cluster on a separate graph # for diagnosis is M and B ggplot(aes(x=PC1, y=PC2, col=cluster), data=df)+geom_point()+facet_grid(.~diagnosis)

As we can see the majority of patients with a “Benign” tumor were in the first cluster and the patients with a “Malignant” tumor at the second cluster.

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Predictive Hacks**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.