How to perform PCA with R?

[This article was first published on François Husson, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post shows how to perform PCA with R and the package FactoMineR.

If you want to learn more on methods such as PCA, you can enroll in this MOOC (everyting is free): MOOC on Exploratory Multivariate Data Analysis

Dataset

Here is a wine dataset, with 10 wines and 27 sensory attributes (like sweetness, bitterness, fruity odor, and so on), 2 preference variables, and a qualitative variable corresponding to the wine labels (there are 2 labels, Sauvignon and Vouvray). The values in the data table correspond to the average score given by several judges for the same wine and descriptive variable. The aim of doing PCA here is to characterize the wines according to their sensory characteristics.

Performing PCA … with additional information

Here are the lines of code used. Note that we use the information given by the qualitative variable.

### Read data
wine

### Loading FactoMineR
library(FactoMineR)

### PCA with supplementary variables
res

### Print the main results
summary(res)

Two graphs are given by default, one for the individuals, one for the quantitative variables.

But is is interesting to consider the qualitative variable to better understand the differences between wines. Wines are colored according to their label.

## Drawing wines according to the label
 plot(res,habillage="Label")

pca_wine

Interpretation

The graph of the individuals shows, for instance, that S Michaud and S Trotignon are very “close”. It means that the scores for S Michaud and S Trotignon are approximately the same, whatever the variable. In the same way, Aub Marigny and Font Coteaux are wines with similar sensory scores for the 27 attributes. On the other hand, Font Brulés and S Trotignon have very different sensory profiles, because the first principal component, representing the main axis of variability between wines, separates them strongly.

The variables astringency, visual intensity, mushroom odor and candied fruit odor, found to the right, have correlations close to 1 with the first dimension. Since the correlation with the 1st dimension is close to 1, the values of these variables move in the same direction as the coordinates in the 1st dimensions. Wines with a small value in the 1st dimension have low values for these variables, and wines with large values in the 1st dimension have high values for these variables. Thus, the wines that are to the right of the plot have high (and positive) values in the 1st dimension and thus have high values for these variables. With the same logic, wines that are to the left have a small value in the 1st dimension, and thus low values for these variables.

For the variables passionfruit odor, citrus odor and freshness, everything is the other way around. The correlation with the 1st dimension is close to -1, and thus the values move in the opposite direction. Wines with a low value in the 1st dimension have low coordinate values, and thus have high values for these variables, and wines with large values in the 1st dimension have small values for these variables.

Overall, we see that the first dimension splits apart wines that are considered fruity and flowery (on the left) from wines that are woody or with vegetal odors. And this is the main source of variability.

So then, how can we interpret the 2nd dimension, the vertical axis? At the top, wines have large values on the vertical axis. Since the correlation coefficients between the 2nd dimension and variables such acidity or bitterness are close to 1, it means that wines at the top take large values for these variables. And wines at the bottom have small values in the 2nd dimension, and thus small values for these variables. For sweetness, the correlation coefficient is close to -1, so wines that have a small value in the 2nd dimension are sweet, while wines that have large values are not.

Overall, the 2nd dimension separates the wines at the top, acidic and bitter, from sweet wines at the bottom.


To leave a comment for the author, please follow the link and comment on their blog: François Husson.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)