K-Means Clustering in R – Exercises

sindri

4 years ago

[This article was first published on R-exercises, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

K-means is efficient, and perhaps, the most popular clustering method. It is a way for finding natural groups in otherwise unlabeled data. You specify the number of clusters you want defined and the algorithm minimizes the total within-cluster variance.

In this exercise, we will play around with the base R inbuilt k-means function on some labeled data.

Solutions are available here.

Exercise 1

Feed the columns with sepal measurements in the inbuilt iris data-set to the k-means; save the cluster vector of each observation. Use 3 centers and set the random seed to 1 before.

Exercise 2

Check the proportions of each species by cluster.

Exercise 3

Make a plot with sepal length on the horizontal axis and width on the vertical axis. Find a way to visualize both the actual species and the cluster the algorithm is categorized into.

Exercise 4

Repeat the clustering from step one, but include petal measurements also. Does the clustering reflect the actual species better now?

< aside class='stb-icon'>

Learn more about cluster analysis in the online course Applied Multivariate Analysis with R. In this course you will learn how to work with hiërarchical clustering, k-means clustering and much more.

Exercise 5

Create a new data-set identical to iris, but multiply the “Petal.Width” by 2. Are the results different now?

Exercise 6

Standardize your new data-set so that each variable has a mean of 0 and a variance of 1; check once more if multiplying by two makes a difference.

Exercise 7

Read in the Titanic train.csv data-set from Kaggle.com (you might have to sign up first). Turn the sex column into a dummy variable, =1, if it is male, “0” otherwise, and “Pclass” into a dummy variable for the most common class 3. Using four columns, Sex, SibSp, Parch and Fare, apply the k-means algorithm to get 4 clusters and use nstart=20. Remember to set the seed to 1 so the results are comparable. Note that these four variables have no special meaning to the problem and dummy data in k-means is probably not a good idea in general – we are just playing around.

Exercise 8

Now, how would you describe, in words, each of the clusters and what the survival rates were?

Exercise 9

Now, 4 clusters was an arbitrary choice. Apply k-means clustering using nstart=20, but using k from 2 to 20 and store the result.

Exercise 10

Plot the percent of variance explained by the clusters (between sum of squares over the total sum of squares). What seems like a reasonable number of clusters, according to the elbow method?

(Picture by giveawayboy)

Related exercise sets:

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.