Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

K-means is the most known and used clustering algorithm. It has however several drawbacks and does not behave nicely on some datasets.

In fact, every clustering algorithm has its own strenghts and drawbacks. Each relies on some assumptions on the dataset and leverages these properties to cluster the data into groups. The No Free Lunch Theorem states that “any two algorithms are equivalent when their performance is averaged across all possible problems”. It means that we cannot use one algorithm for every dataset. The role of the user is thus to choose the right algorithm for the dataset to handle. To do so, it is important to know how the algorithms behave on several datasets and understand how they will handle new datasets. The properties of the datasets and the assumptions on which the algorithm rely, along with its criterion, can inform us on this.

This blog post presents the k-means clustering algorithm by:

• applying it with the R language on the datasets generated in a previous post;
• looking at the assumptions it makes on the dataset;
• presenting its criterion and the family of algorithms it belongs to;

It is not an extensive overview on k-means: For instance, it does not present the algorithm nor its computational cost. We will also not look at some important problems we face when using it, such as falling into local optima and finding the appropriate number of clusters.

## K-means behavior on generated datasets

Let’s first see how it behaves on the some generated datasets. We can generate some interesting datasets with the following R code:

library(mvtnorm)
library(dplyr)
library(movMF)

generateGaussianData <- function(n, center, sigma, label) {
data = rmvnorm(n, mean = center, sigma = sigma)
data = data.frame(data)
names(data) = c("x", "y")
data = data %>% mutate(class=factor(label))
data
}

dataset1 <- {
# cluster 1
n = 500
center = c(5, 5)
sigma = matrix(c(1, 0, 0, 1), nrow = 2)
data1 = generateGaussianData(n, center, sigma, 1)

# cluster 2
n = 500
center = c(1, 1)
sigma = matrix(c(1, 0, 0, 1), nrow = 2)
data2 = generateGaussianData(n, center, sigma, 2)

# all data
data = bind_rows(data1, data2)

data$dataset = "1 - Mixture of Gaussians" data } dataset2 <- { # cluster 1 n = 2000 center = c(5, 5) sigma = matrix(c(1, 0, 0, 1), nrow = 2) data1 = generateGaussianData(n, center, sigma, 1) # cluster 2 n = 50 center = c(1, 1) sigma = matrix(c(1, 0, 0, 1), nrow = 2) data2 = generateGaussianData(n, center, sigma, 2) # all data data = bind_rows(data1, data2) data$dataset = "2 - Different sizes"

data
}

dataset3 <- {
# cluster 1
n = 100
center = c(5, 5)
sigma = matrix(c(3, 0, 0, 3), nrow = 2)
data1 = generateGaussianData(n, center, sigma, 1)

# cluster 2
n = 100
center = c(1, 1)
sigma = matrix(c(0.1, 0, 0, 0.1), nrow = 2)
data2 = generateGaussianData(n, center, sigma, 2)

# all data
data = bind_rows(data1, data2)

data$dataset = "3 - Different variances" data } dataset4 <- { # cluster 1 n = 500 center = c(4, 4) sigma = matrix(c(1, -0.99, -0.99, 1), nrow = 2) data1 = generateGaussianData(n, center, sigma, 1) # cluster 2 n = 500 center = c(5, 5) sigma = matrix(c(1, -0.99, -0.99, 1), nrow = 2) data2 = generateGaussianData(n, center, sigma, 2) # all data data = bind_rows(data1, data2) data$dataset = "4 - Non zero covariance"

data
}

generateMFData <- function(n, theta, cluster, scale=1) {
data = rmovMF(n, theta)
data = data.frame(data[,1], data[,2])
data = scale * data
names(data) = c("x", "y")
data = data %>% mutate(class=factor(cluster))
data
}

dataset6 <- {
# cluster 1
n = 100
data1 = generateMFData(n, 1 * c(-1, 1), 1, 5)

# cluster 2
n = 100
data2 = generateMFData(n, 1 * c(1, -1), 2, 1)

# all data
data = bind_rows(data1, data2)

data$dataset = "6 - Spherical classes" data } generateSpiralData <- function(n) { maxRadius = 7 xShift = 2.5 yShift = 2.5 angleStart = 2.5 * pi noiseVariance = 0.1 # first spiral firstSpiral <- function() { d1 = data.frame(0:(n-1)) colnames(d1) <- c("i") d1 %>% mutate(angle = angleStart + 2.5 * pi * (i / n), radius = maxRadius * (n + n/5 - i)/ (n + n/5), x = radius * sin(angle), y = radius * cos(angle), class=1) } d1 = firstSpiral() # second spiral d2 = d1 %>% mutate(x = -x, y = -y, class=2) # combine, add noise, and shift generateNoise <- function(n) { sigma = matrix(c(noiseVariance, 0, 0, noiseVariance), nrow = 2) noise = rmvnorm(n, mean = c(0, 0), sigma = sigma) df = data.frame(noise) colnames(df) <- c("xNoise", "yNoise") df } d1 %>% bind_rows(d2) %>% bind_cols(generateNoise(2*n)) %>% transmute(x = x + xShift + xNoise, y = y + yShift + yNoise, dataset = "7 - Spirals", class = factor(class)) } dataset7 = generateSpiralData(500) dataset8 <- { generateUniformData <- function(cluster, minX, maxX) { x = runif(500, min = minX, max = maxX) y = runif(500, min = -4, max = 9) data.frame(x, y) %>% mutate(class=cluster) } generateUniformData(1, -4, 1) %>% bind_rows(generateUniformData(2, 3, 9)) %>% mutate(dataset = "8 - Uniform data", class = factor(class)) } dataset9 <- { generateuniformdata <- function(cluster) { x = runif(500, min = -4, max = 9) y = runif(500, min = -4, max = 9) data.frame(x, y) %>% mutate(class=cluster) } generateuniformdata(1) %>% bind_rows(generateuniformdata(2)) %>% mutate(dataset = "9 - unclustered data", class = factor(class)) } data = bind_rows(dataset1, dataset2, dataset3, dataset4, dataset5, dataset6, dataset7, dataset8, dataset9)  One can then apply the k-means algorithm on these datasets with following code: dataKmeans = data %>% group_by(dataset) %>% do(data.frame(., cluster = factor(kmeans(cbind(.$x,.$y), centers=2)$cluster)))


Note that we used the do function of dplyr to apply the algorithm on each dataset. The kmeans algorithm comes built-in with R (other implementations are available through libraries).

Plotting the nine datasets and color the points given the clusters found by the algorithm can be done with:

library(ggplot2)

scatterPlotWithLabels <- function(data) {
scatterPlot = ggplot(data, aes(x=x, y=y, shape=class, color=cluster)) +
geom_point() +
coord_fixed() +
scale_shape_manual(values=c(2, 3)) +
facet_wrap(~dataset, nrow=3)
scatterPlot
}

scatterPlotWithLabels(dataKmeans)


It draws the following plot: It works pretty well for the first dataset, but unfortunately not for the others. This behavior is predictable when we know the criterion the algorithm is optimizing and how the data is generated.

## K-means assumptions and criterion

For a given number of cluster K, and a set of N vectors $$x_i , i \in [1, N]$$, K-means aims to minimize the following criterion:

\begin{equation*} W(z, \mu) = \sum_{k=1}^{K} \sum_{i=1}^{N} z_{ik} ||x_i - \mu_k||^2 \end{equation*}

with:

• $$z_{ik} \in {0, 1}$$ indicates if the vector $$x_i$$ belongs to the cluster $$k$$;
• $$\mu_k \in \mathbb{R}^p$$, the center of the cluster $$k$$.

It minimizes the intra-cluster variance, i.e. the sum of the squared distances between the center of the cluster and the objects associated with it.

To understand why the algorithm works well on the first dataset and not on the others, it is helpfull to see that the criterion is a particular case of a broader one: the Gaussian mixture models try to maximize the following criterion:

\begin{equation*} L_C(\theta) = \sum_{i, k} z_{ik} \log (\pi_k \phi_k(x_i, \alpha_k)) \end{equation*}

where:

• $$\theta = (\pi, \alpha)$$ with $$\pi$$ is a vector of proportions and $$\alpha$$ is a vector of parameters of a component;
• $$\phi_k(x_i, \alpha_k)$$ is the density of an observation $$x_i$$ from the $$k^{th}$$ component.

If we consider a mixture of Gaussians, the density function is defined by:

\begin{equation*} \phi_k(x; \mu, \Sigma) = \frac{1}{(2 \pi)^{n/2} |\Sigma|^{1/2}} \exp (- \frac{1}{2} (x - \mu)^T \Sigma^{-1} (x-\mu)) \end{equation*}

with:

• $$n$$ the number of dimensions of the vector $$x$$;
• $$\mu$$ the mean vector;
• $$\Sigma$$ the covariance matrix.

If we consider a mixture of Gaussians with an identity covariance matrix (same variance and no covariance) and equal proportions (all $$\pi_k$$ are equals), we obtain, up to a constant, the criterion of k-means.

Thus it is predictable that the k-means algorithm works well on the first dataset since it fits perfectly the assumptions on the probability distribution generating the objects to cluster.

## Take away

Here are a few things to remember from this blog post:

• There is no clustering algorithm to rule them all: the user should choose carefully the algorithm given the data and the criterion to optimize
• K-means assumes the data is generated by a mixture of Gaussians with the same variance, the same proportion (same number of objects per class) and no intra-class covariance.

The second point can be see two ways:

• If you have a prior on the data, you can choose the algorithm to cluster data according to the criterion it optimizes and the prior you have on the data.
• If you don’t have any prior on the data, you can still apply k-means but should not overinterpret the results. For instance, the size of the clusters is enforced by the algorithm, thus you cannot infer that the underlying classes have equal proportions.