PCA – hierarchical tree – partition: Why do we need to choose for visualizing data?

[This article was first published on François Husson, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Principal component methods such as PCA (principal component analysis) or MCA (multiple correspondence analysis) can be used as a pre-processing step before clustering.

But principal component methods give also a framework to visualize data. Thus, the clustering methods can be represented onto the map provided by the principal component method. In the figure below, the hierarchical tree is represented in 3D onto the principal component map (using the first 2 component obtained with PCA). And then, a partition has been done and individuals are coloured according to their belonging cluster.

arbre_temperature

Thus, the graph gives simultaneously the information given by  the principal component map, the hierarchical tree and the clusters (see th function HCPC in the FactoMineR package).

library(FactoMineR)

temperature <- read.table(“http://factominer.free.fr/livre/temperat.csv&#8221;,
      header=TRUE, sep=”;”, dec=”.”, row.names=1)

res.pca <- PCA(temperature[1:23,], scale.unit=TRUE, ncp=Inf,
     graph = FALSE,quanti.sup=13:16,quali.sup=17)

res.hcpc <- HCPC(res.pca)

The approaches complement one another in two ways:

  • firstly, a continuous view (the trend identified by the principal components) and a discontinuous view (the clusters) of the same data set are both represented in a unique framework;
  • secondly, the two-dimensional map provides no information about the position of the individuals in the other dimensions; the tree and the clusters, defined from more dimensions, offer some information “outside of the map”; two individuals close together on the map can be in the same cluster (and therefore not too far from one another along the other dimensions) or in two different clusters (as they are far from one another along other dimensions).

So why do we need to choose when we want to better visualize the data?

The example shows the common use of PCA and clustering methods, but rather than PCA we can use correspondence analysis on contingency tables, or multiple correspondence analysis on categorical variables.

If you want to learn more, you can see this video, or you cab enroll in this MOOC (free) and you can see this unpublished paper.


To leave a comment for the author, please follow the link and comment on their blog: François Husson.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)