# Generate datasets to understand some clustering algorithms behavior

**Stanislas Morbieu - R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In order to understand how a clustering algorithm works, good sample datasets are useful to highlight its behavior under certain circumstances. This post shows how to generate 9 datasets:

- a mixture of two Gaussians with same size, variance and no covariance,
- Gaussians which differ only from their means and
*sizes*, - Gaussians which differ only from their means and
*variances*, - Gaussians with a different direction,
- a general mixture of Gaussians,
- spherical data,
- a mixture of uniform data,
- and unclustered data.

## Mixture of two Gaussians

Our first dataset consists of a mixture of two Gaussians.

Let’s first define a function to generate a Gaussian. This function takes four arguments:

- a number of samples to generate,
`n`, - a center, corresponding to the mean of the Gaussian,
- a covariance matrix,
`sigma`, - and a cluster label.

library(mvtnorm) library(dplyr) generateGaussianData <- function(n, center, sigma, label) { data = rmvnorm(n, mean = center, sigma = sigma) data = data.frame(data) names(data) = c("x", "y") data = data %>% mutate(class=factor(label)) data }

We use the function `rmvnorm` from the package `mvtnorm` to generate random numbers following a multivariate normal distribution.

The package `dplyr` is used to manipulate dataframes, especially the `%>%` operator allows to pass the variable on the left of the operator as the first argument of the function on the right. It is convenient since we can read the processing pipeline from the left to the right, the transformation steps being the functions between the `%>%` operators.

For instance, the `mutate` function creates a new column named `class`.

We can now generate our first dataset:

dataset1 <- { # cluster 1 n = 500 center = c(5, 5) sigma = matrix(c(1, 0, 0, 1), nrow = 2) data1 = generateGaussianData(n, center, sigma, 1) # cluster 2 n = 500 center = c(1, 1) sigma = matrix(c(1, 0, 0, 1), nrow = 2) data2 = generateGaussianData(n, center, sigma, 2) # all data data = bind_rows(data1, data2) data$dataset = "1 - Mixture of Gaussians" data }

We defined two clusters of equal size (500 samples each), with an identity covariance matrix. The function `bind_rows` is defined by the `dplyr` package and stacks the two dataframes `data1` and `data2`. The variable `data` is thus a dataframe with \(500 + 500 = 1000\) rows, each row being an observation.

We can see how the dataframe looks like with:

str(dataset1)

It outputs:

'data.frame': 1000 obs. of 4 variables: $ x : num 3,79 5,09 5,50 4,50 4,94 ... $ y : num 5,35 5,95 6,52 4,82 5,52 ... $ class : chr "1" "1" "1" "1" ... $ dataset: chr "1 - Mixture of Gaussians" "1 - Mixture of Gaussians" "1 - Mixture of Gaussians" "1 - Mixture of Gaussians" ...

We can also visualize this dataset:

library(ggplot2) dataset1 %>% ggplot(aes(x=x, y=y, shape=class)) + geom_point() + coord_fixed() + scale_shape_manual(values=c(2, 3))

We used the `ggplot2` package to have a scatter plot with `geam_point`. The function `coord_fixed()` is here to have the same scale for the two axis, and we define the symbols to use for the points with `scale_shape_manual(values=c(2, 3))`. We take the columns `x` and `y` as coordinates and the shape is given by the `class` column.

Here is the plot we get:

This example is very specific since the variances of the two classes are equal, there is no covariance between `x` and `y` inside a class, and the two classes have the same size (*i.e.* number of samples per class). It makes some algorithms (k-means for instance) behave well, *i.e.* able to recover the true classes.

## Mixture of two Gaussians with different sizes

For the second dataset, we generate two Gaussians as before, but this time, the two classes have distinct sizes:

dataset2 <- { # cluster 1 n = 2000 center = c(5, 5) sigma = matrix(c(1, 0, 0, 1), nrow = 2) data1 = generateGaussianData(n, center, sigma, 1) # cluster 2 n = 50 center = c(1, 1) sigma = matrix(c(1, 0, 0, 1), nrow = 2) data2 = generateGaussianData(n, center, sigma, 2) # all data data = bind_rows(data1, data2) data$dataset = "2 - Different sizes" data }

We can plot it as before:

dataset2 %>% ggplot(aes(x=x, y=y, shape=class)) + geom_point() + coord_fixed() + scale_shape_manual(values=c(2, 3))

It produces:

We will see in a following post that this simple relaxation of the assumption on the sizes of the classes can make some algorithms (k-means for instance) behave differently, so that they cannot recover the true classes.

## Mixture of two Gaussians with different variances

The third dataset is also generated by two Gaussians but with different variances:

dataset3 <- { # cluster 1 n = 100 center = c(5, 5) sigma = matrix(c(3, 0, 0, 3), nrow = 2) data1 = generateGaussianData(n, center, sigma, 1) # cluster 2 n = 100 center = c(1, 1) sigma = matrix(c(0.1, 0, 0, 0.1), nrow = 2) data2 = generateGaussianData(n, center, sigma, 2) # all data data = bind_rows(data1, data2) data$dataset = "3 - Different variances" data }

The corresponding plot is:

Again, a simple relaxation, this time on the variances, makes some algorithms unable to recover the true classes.

## Mixture of Gaussians with non zero covariance

The fourth dataset consists of two Gaussians, but this time with high covariance:

dataset4 <- { # cluster 1 n = 500 center = c(4, 4) sigma = matrix(c(1, -0.99, -0.99, 1), nrow = 2) data1 = generateGaussianData(n, center, sigma, 1) # cluster 2 n = 500 center = c(5, 5) sigma = matrix(c(1, -0.99, -0.99, 1), nrow = 2) data2 = generateGaussianData(n, center, sigma, 2) # all data data = bind_rows(data1, data2) data$dataset = "4 - Non zero covariance" data }

The corresponding plot is:

## Mixture of Gaussians with different sizes, variances and covariance

The Gaussian distribution has two parameters: the mean and the covariance matrix. When generating a mixture of Gaussians, a third parameter step in. Let's see how to combine the previous examples: changing all these parameters at the same time:

dataset5 <- { # cluster 1 n = 1000 center = c(1, 1) sigma = matrix(c(3, -0.9, -0.9, 3), nrow = 2) data1 = generateGaussianData(n, center, sigma, 1) # cluster 2 n = 50 center = c(5, 5) sigma = matrix(c(1, 0.9, 0.9, 1), nrow = 2) data2 = generateGaussianData(n, center, sigma, 2) # all data data = bind_rows(data1, data2) data$dataset = "5 - Disparate Gaussians" data }

The corresponding plot is:

We can now move on to another distribution.

## Spherical data

Some real world datasets are inherently spherical, *i.e.* the points are lying on the surface of a sphere, so generating a spherical dataset is helpful to understand how an algorithm behave on this kind of data, in a controlled environment (we know our dataset better when we generate it).

Our spherical dataset consists of a mixture of data following a

Von Mises-Fisher distribution.

Let's first define a function to generate one class, *i.e.* one Von Mises-Fisher distribution:

library(movMF) generateMFData <- function(n, theta, cluster, scale=1) { data = rmovMF(n, theta) data = data.frame(data[,1], data[,2]) data = scale * data names(data) = c("x", "y") data = data %>% mutate(class=factor(cluster)) data }

The function `rmovMF` from the `movMF` package generates samples given the two parameters of the Von Mises-Fisher distribution. We can now generate our mixture:

dataset6 <- { # cluster 1 n = 100 data1 = generateMFData(n, 1 * c(-1, 1), 1, 5) # cluster 2 n = 100 data2 = generateMFData(n, 1 * c(1, -1), 2, 1) # all data data = bind_rows(data1, data2) data$dataset = "6 - Spherical classes" data }

The corresponding plot is:

We can see that all the points are lying on the surface of the hypersphere.

## Spirals

We are going to generate the following dataset:

For this, we will create data for one spiral, by creating columns for:

- the angle which spreads evenly between
`angleStart`and`angleStart + 2.5 * pi`, - the radius which decrements for each sample,
- the
`x`and`y`coordinates, based on the radius and the angle.

The second spiral is the same as the first one, except we take the negative values of `x` and `y`.

We then combine the two dataframes and add some noise, and shift it so the coordinates stay in the same range as in the other datasets.

Here is the complete code to produce it:

generateSpiralData <- function(n) { maxRadius = 7 xShift = 2.5 yShift = 2.5 angleStart = 2.5 * pi noiseVariance = 0.1 # first spiral firstSpiral <- function() { d1 = data.frame(0:(n-1)) colnames(d1) <- c("i") d1 %>% mutate(angle = angleStart + 2.5 * pi * (i / n), radius = maxRadius * (n + n/5 - i)/ (n + n/5), x = radius * sin(angle), y = radius * cos(angle), class=1) } d1 = firstSpiral() # second spiral d2 = d1 %>% mutate(x = -x, y = -y, class=2) # combine, add noise, and shift generateNoise <- function(n) { sigma = matrix(c(noiseVariance, 0, 0, noiseVariance), nrow = 2) noise = rmvnorm(n, mean = c(0, 0), sigma = sigma) df = data.frame(noise) colnames(df) <- c("xNoise", "yNoise") df } d1 %>% bind_rows(d2) %>% bind_cols(generateNoise(2*n)) %>% transmute(x = x + xShift + xNoise, y = y + yShift + yNoise, dataset = "7 - Spirals", class = factor(class)) } dataset7 = generateSpiralData(500)

We used the `transmute` function from `dplyr` to create new columns and remove the other ones (used only temporarily).

## Uniform data

We can generate data from the uniform distribution with the function `runif`:

dataset8 <- { generateUniformData <- function(cluster, minX, maxX) { x = runif(500, min = minX, max = maxX) y = runif(500, min = -4, max = 9) data.frame(x, y) %>% mutate(class=cluster) } generateUniformData(1, -4, 1) %>% bind_rows(generateUniformData(2, 3, 9)) %>% mutate(dataset = "8 - Uniform data", class = factor(class)) }

The corresponding plot is:

## Unclustered data

The last dataset is almost the same as the previous one but the ranges for the uniform distributions are the same, so it is impossible to distinguish between the two classes:

dataset9 <- { generateuniformdata <- function(cluster) { x = runif(500, min = -4, max = 9) y = runif(500, min = -4, max = 9) data.frame(x, y) %>% mutate(class=cluster) } generateuniformdata(1) %>% bind_rows(generateuniformdata(2)) %>% mutate(dataset = "9 - unclustered data", class = factor(class)) }

The corresponding plot is:

## All datasets in one plot

To get a big picture of all the datasets we generated, we can combine them all, in a single dataframe, and plot it in a single plot:

data = bind_rows(dataset1, dataset2, dataset3, dataset4, dataset5, dataset6, dataset7, dataset8, dataset9) scatterPlot <- function(data) { scatterPlot = ggplot(data, aes(x=x, y=y, shape=class)) + geom_point() + coord_fixed() + scale_shape_manual(values=c(2, 3)) + facet_wrap(~dataset, nrow=3) scatterPlot } scatterPlot(data)

The `facet_wrap` function of the `ggplot2` package allows us to plot all the datasets in a single plot:

We will use these datasets in other posts to illustrate the behavior of some clustering algorithms.

**leave a comment**for the author, please follow the link and comment on their blog:

**Stanislas Morbieu - R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.