Introduction to k-Means clustering in R

May 29, 2016
By

(This article was first published on R Language – the data science blog, and kindly contributed to R-bloggers)

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. I have provided below the R code to get started with k-means clustering in R. The dataset can be downloaded from here.

# Topics Covered
#
# 1. Reading data and Summary Statistics
# 2. Determining the Optimal Number of Clusters
# 3. Running Clustering Algorithm and Visualisations

##############################################################################
#Reading data and Summary Statistics

#change the working directory
setwd("C:\\Users\\ujjwal.karn\\Desktop\\Classification & Clustering")

str(mydata)
summary(mydata)

plot(mydata[c("Sepal.Length", "Sepal.Width")], main="Raw Data")

#standardising the data
mydata <- scale(mydata)

##############################################################################
#Determining the Optimal Number of Clusters
#http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters/

wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))

for(i in 1:25){wss[i] <- sum(kmeans(mydata, centers=i)\$withinss)}

plot(1:25, wss, type="b", xlab="No. of Clusters", ylab="wss")

wss

##############################################################################
#Running Clustering Algorithm

# trying with 4 clusters
clus4 <- kmeans(mydata, centers=4, nstart=30)

#check between_SS / total_SS
clus4

# get cluster means
aggregate(mydata ,by=list(clus4\$cluster), FUN=mean)

# append cluster assignment
mydata <- data.frame(mydata, clus4\$cluster)

#summary
groups <- data.frame(clus4\$cluster)
table(groups)

plot(mydata[c("Sepal.Length", "Sepal.Width")], col=clus4\$cluster)
points(clus4\$centers[,c("Sepal.Length", "Sepal.Width")], col=1:3, pch=8, cex=2)

# trying with 3 clusters
clus3 <- kmeans(mydata, centers=3, nstart=20)
clus3

# get cluster means
aggregate(mydata ,by=list(clus3\$cluster), FUN=mean)

# append cluster assignment
mydata <- data.frame(mydata, clus3\$cluster)

#summary
groups <- data.frame(clus3\$cluster)
table(groups)

plot(mydata[c("Sepal.Length", "Sepal.Width")], col=clus3\$cluster)
points(clus3\$centers[,c("Sepal.Length", "Sepal.Width")], col=1:3, pch=8, cex=2)  To leave a comment for the author, please follow the link and comment on their blog: R Language – the data science blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...