Principal component methods such as PCA (principal component analysis) or MCA (multiple correspondence analysis) can be used as a pre-processing step before clustering.
But principal component methods give also a framework to visualize data. Thus, the clustering methods can be represented onto the map provided by the principal component method. In the figure below, the hierarchical tree is represented in 3D onto the principal component map (using the first 2 component obtained with PCA). And then, a partition has been done and individuals are coloured according to their belonging cluster.
Thus, the graph gives simultaneously the information given by the principal component map, the hierarchical tree and the clusters (see th function HCPC in the FactoMineR package).
temperature <- read.table(“http://factominer.free.fr/livre/temperat.csv”,
header=TRUE, sep=”;”, dec=”.”, row.names=1)
res.pca <- PCA(temperature[1:23,], scale.unit=TRUE, ncp=Inf,
graph = FALSE,quanti.sup=13:16,quali.sup=17)
res.hcpc <- HCPC(res.pca)
The approaches complement one another in two ways:
- firstly, a continuous view (the trend identified by the principal components) and a discontinuous view (the clusters) of the same data set are both represented in a unique framework;
- secondly, the two-dimensional map provides no information about the position of the individuals in the other dimensions; the tree and the clusters, defined from more dimensions, offer some information “outside of the map”; two individuals close together on the map can be in the same cluster (and therefore not too far from one another along the other dimensions) or in two different clusters (as they are far from one another along other dimensions).
So why do we need to choose when we want to better visualize the data?
The example shows the common use of PCA and clustering methods, but rather than PCA we can use correspondence analysis on contingency tables, or multiple correspondence analysis on categorical variables.