Principal Component Analysis Cluster Plots with Plotly

[This article was first published on R – Modern Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Problem

When clustering data using principal component analysis, it is often of interest to visually inspect how well the data points separate in 2-D space based on principal component scores. While this is fairly straightforward to visualize with a scatterplot, the plot can become cluttered quickly with annotations as shown in the following figure:

Solution using ggrepel

The ggrepel package by Kamil Slowikowski implements functions to repel overlapping text labels away from each other and away from the data points that they label. It’s an easy to use package that works well in this example as shown in the following figure:

Solution using plotly

An alternative solution is to use interactive plots that are usable from the R console, in the RStudio viewer pane, in R Markdown documents, and in Shiny apps. Annotations can be viewed by hovering the mouse pointer over a point or dragging a rectangle around the relevant area to zoom in. Interactive plots using plotly allow you to de-clutter the plotting area, include extra annotation information and create interactive web-based visualizations directly from R. Once uploaded to a plotly account, plotly graphs (and the data behind them) can be viewed and modified in a web browser.

The resulting plot is clean and not cluttered with text annotations. While the ggrepel package provides a nice solution in this example, the plotly solution will be even more useful with a larger number of data points.

The Code

Principal Component Analysis and Hierarchical Clustering

# cor = TRUE indicates that PCA is performed on 
# standardized data (mean = 0, variance = 1)
pcaCars <- princomp(mtcars, cor = TRUE)

# view objects stored in pcaCars
names(pcaCars)

# proportion of variance explained
summary(pcaCars)

# scree plot
plot(pcaCars, type = "l")

# cluster cars
carsHC <- hclust(dist(pcaCars$scores), method = "ward.D2")

# dendrogram
plot(carsHC)

# cut the dendrogram into 3 clusters
carsClusters <- cutree(carsHC, k = 3)

# add cluster to data frame of scores
carsDf <- data.frame(pcaCars$scores, "cluster" = factor(carsClusters))
carsDf <- transform(carsDf, cluster_name = paste("Cluster",carsClusters))

First figure using ggplot2

library(ggplot2)
p1 <- ggplot(carsDf,aes(x=Comp.1, y=Comp.2)) +
      theme_classic() +
      geom_hline(yintercept = 0, color = "gray70") +
      geom_vline(xintercept = 0, color = "gray70") +
      geom_point(aes(color = cluster), alpha = 0.55, size = 3) +
      xlab("PC1") +
      ylab("PC2") + 
      xlim(-5, 6) + 
      ggtitle("PCA Clusters from Hierarchical Clustering of Cars Data") 

p1 + geom_text(aes(y = Comp.2 + 0.25, label = rownames(carsDf)))

Second figure using ggplot2 with ggrepel

library(ggplot2)
library(ggrepel)

p1 + geom_text_repel(aes(y = Comp.2 + 0.25, label = rownames(carsDf)))

Interactive plot using plotly

library(plotly)
p <- plot_ly(carsDf, x = Comp.1 , y = Comp.2, text = rownames(carsDf),
             mode = "markers", color = cluster_name, marker = list(size = 11)) 

p <- layout(p, title = "PCA Clusters from Hierarchical Clustering of Cars Data", 
       xaxis = list(title = "PC 1"),
       yaxis = list(title = "PC 2"))

p

References

PCA with R by Gaston Sanchez

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)