Principal Component Analysis (PCA) from Scratch
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
How to perform PCA step by step using R and basic linear algebra functions and operations.
What is PCA?
PCA is an exploratory data analysis based in dimensions reduction. The general idea is to reduce the dataset to have fewer dimensions and at the same time preserve as much information as possible.
PCA allows us to make visual representations in two dimensions and check for groups or differences in the data related to different states, treatments, etc. Besides, we can get some clue about which variables in the data are responsible for the visual differences.
It is important to highlight that PCA is not used exclusively for the above, and since is an exploratory analysis, the data similarities or differences must be considered in the context where the data come from.
A simple case with data in two dimentions
Let’s start simple. I have the data with just two variables to show you some core concepts behind PCA. Then, we will be able to generalize to data with more dimensions.
Data
I have simulated the next data:
set.seed(1) # Variable 1 var_1 <- rnorm(50, 50, sd = 3) # Variable 2 var_2 <- .5*var_1 + rnorm(50, sd = sqrt(3)) # Both variables in a data frame data_set_1 <- data.frame(var_1, var_2) head(data_set_1) ## var_1 var_2 ## 1 48.12064 24.74986 ## 2 50.55093 24.21540 ## 3 47.49311 24.33739 ## 4 54.78584 25.43681 ## 5 50.98852 27.97633 ## 6 47.53859 27.19945
A scatter plot can show the spread and the possible relation between both variables:
library(ggplot2) # A scatter plot with the two simulated variables ggplot(data_set_1, aes(x = var_1, y = var_2)) + geom_point(color = "blue", size = 2) + xlab("Variable 1") + ylab("Variable 2") + theme_classic()
Center each variable
The first step in PCA is center both variables:
library(dplyr) # Subtract the mean from each variable data_set_1 <- data_set_1 %>% mutate(varc_1 = var_1 - mean(var_1), varc_2 = var_2 - mean(var_2)) head(data_set_1) ## var_1 var_2 varc_1 varc_2 ## 1 48.12064 24.74986 -2.1807063 -0.60402890 ## 2 50.55093 24.21540 0.2495851 -1.13848362 ## 3 47.49311 24.33739 -2.8082307 -1.01649408 ## 4 54.78584 25.43681 4.4844976 0.08291914 ## 5 50.98852 27.97633 0.6871785 2.62244372 ## 6 47.53859 27.19945 -2.7627500 1.84556287
Note that the above do not modify the relative position between each point, so the data looks the same:
# Scatter plot for the centered data ggplot(data_set_1, aes(x = varc_1, y = varc_2)) + geom_point(color = "blue", size = 2) + geom_vline(xintercept = 0, size = .5) + geom_hline(yintercept = 0, size = .5) + theme_classic()
Calculate the covariance matrix
You can calculate the covariance matrix for a given set of variables just by performing a matrix multiplication on the transformed data as follows:
# Select just the centered variables data_set_2 <- data_set_1 %>% select(varc_1, varc_2) %>% as.matrix() # Calculate the covariance matrix cov_m <- (t(data_set_2) %*% data_set_2) / (nrow(data_set_2) - 1) cov_m ## varc_1 varc_2 ## varc_1 6.220943 2.946877 ## varc_2 2.946877 4.207523
At this case, the diagonal contains the variances for each variable while values out of the diagonal are the covariances between those (see the figure below).
The same result with the cov
function:
# Covariance matrix using cov() cov(data_set_2) ## varc_1 varc_2 ## varc_1 6.220943 2.946877 ## varc_2 2.946877 4.207523
Or the crossprod
function:
# Covariance matrix using crossprod() crossprod(data_set_2) / (nrow(data_set_2) - 1) ## varc_1 varc_2 ## varc_1 6.220943 2.946877 ## varc_2 2.946877 4.207523
Obtain the eigenvalues and eigenvectors from the covariance matrix
Principal components represent the directions of the data that explain the maximal amount of variance. They are “lines” that capture most information of the data. These directions can be obtained through calculate the eigenvalues and eigenvectors from the covariance matrix:
# Use eigen() to obtain eigenvectors and eigenvalues cov_e <- eigen(cov_m) # Eigenvectors e_vec <- cov_e$vectors # Eigenvalues e_val <- cov_e$values
The span of each eigenvector can be considered that “line” that capture most variation:
# First eigenvector ev_1 <- e_vec[,1] # First eigenvector slope ev1_m <- ev_1[2] / ev_1[1] # Second eigenvector ev_2 <- e_vec[,2] # Second eigenvector slope ev2_m <- ev_2[2] / ev_2[1] # Scatter plot showing the span of both eigenvectors ggplot(data.frame(data_set_2), aes(x = varc_1, y = varc_2)) + geom_point(color = "blue", size = 2) + geom_vline(xintercept = 0, size = .5) + geom_hline(yintercept = 0, size = .5) + geom_abline(slope = ev1_m, color = "blue", size = 0.7) + geom_abline(slope = ev2_m, color = "red", size = 0.7) + theme_classic()
As you can see, there is an eigenvector for each variable in the data set, at this case two. Note that eigenvectors are perpendicular:
# Multiply both eigenvectors ev_1 %*% ev_2 ## [,1] ## [1,] 0
Regarding to the eigenvalues, their numeric values are equal to the sum of squares of the distances of each projected data point in the corresponding principal component. This sum of squares is maximized in the first principal component.
Make a Scree Plot
Dividing each eigenvalue by n – 1 (n is the number of rows in the original data) will provide an estimation of the variance accounted by each principal component. The sum of all the variances (the total variance) can be used to calculate the percentage of variation and visualize them with a Scree Plot:
# Calculate the estimated variance for each eigenvalue e_var <- e_val / (nrow(data_set_2) - 1) # Data frame with variance percentages var_per <- data.frame( PC = c("PC1", "PC2"), PER = c(e_var) * 100 / sum(e_var) # Calculate the percentage ) # Scree plot ggplot(var_per, aes(x = PC, y = PER)) + geom_col(width = 0.5, color = "black") + xlab("Principal component") + ylab("Percentage of variation (%)") + theme_classic()
Loading scores
The eigenvectors retrieved by the function eigen
are normalized. This means that their longitude matches 1:
# Norm of the first eigenvector norm(as.matrix(ev_1), "F") ## [1] 1 # Norm of the second eigenvector norm(as.matrix(ev_2), "F") ## [1] 1
The elements of each eigenvector are also called loadings and can be interpreted as the contribution of each variable in the data set to the corresponding principal component, or, more strictly, as the coefficients of the linear combination of the original variables from which the principal components are constructed.
You can make a table with these values and see the contributions of each variable to each principal component:
# Data frame with both eigenvectors loads <- data.frame( VAR = c("var_1", "var_2"), PC1 = ev_1, # First eigenvecor PC2 = ev_2 # Second eigenvectors ) loads ## VAR PC1 PC2 ## 1 var_1 -0.8134113 0.5816890 ## 2 var_2 -0.5816890 -0.8134113
The above can be useful in data with many dimensions to establish which variables are causing the clusters or differences in the PCA plot.
Represent the data in lower dimensions
If we change the basis of the original data to that indicated by the eigenvectors, basically we are rotating the data:
# Inverse of eigenvectors matrix inv_evec <- solve(e_vec) # Change the basis of the original data data_set_3 <- data_set_2 %*% inv_evec # Scatter showing the rotation ggplot(data.frame(data_set_3), aes(X1, X2)) + geom_point(color = "blue", size = 2) + geom_vline(xintercept = 0, size = .5) + geom_hline(yintercept = 0, size = .5) + xlab("PC1 (78.8%)") + ylab("PC2 (21.2%)") + theme_classic()
You can compare both graphs to have an idea of how data has been rotated once we change the basis:
library(ggpubr) # Scatter plot with the centered data plot_data <- ggplot(data.frame(data_set_2), aes(x = varc_1, y = varc_2)) + geom_point(color = "blue", size = 2) + geom_vline(xintercept = 0, size = .5) + geom_hline(yintercept = 0, size = .5) + ylim(c(-8, 8.5)) + ggtitle("Original Data") + theme_classic() # Scatter plot with the rotated data plot_rotation <- ggplot(data.frame(data_set_3), aes(X1, X2)) + geom_point(color = "blue", size = 2) + geom_vline(xintercept = 0, size = .5) + geom_hline(yintercept = 0, size = .5) + xlab("PC1 (78.8%)") + ylab("PC2 (21.2%)") + ylim(c(-8, 8.5)) + ggtitle("Change of Basis to Eigenvectors") + theme_classic() # Both graphs side by side ggarrange(plot_data, plot_rotation)
Since PC1 explain most of the variance in the data, we can omit PC2 and represent each point in just one dimension, here as red points:
# Data points just from PC 1 data_pc1 <- data.frame(v1 = data_set_3[,1], v2 = rep(0, nrow(data_set_3))) # Scatter plot showing the projected points from PC1 (red points) ggplot(data.frame(data_set_3), aes(X1, X2)) + geom_point(color = "blue", size = 2) + geom_point(data = data_pc1, aes(v1, v2), color = "red", size = 2) + geom_vline(xintercept = 0, size = .5) + geom_hline(yintercept = 0, size = .5) + xlab("PC1 (78.8%)") + ylab("PC2 (21.2%)") + ylim(c(-8, 8.5)) + theme_classic()
The ideas above can be used in data with many variables to reduce dimensions and represent the data with 2D plots.
Generalization to more dimensions
Now let’s apply the same procedure in a data set with more variables, specifically in data from wine samples where the follow responses have been measured:
The wine samples come from Argentina, Chile, Australia, and South Africa:
The first six rows of the data set looks like this:
# Variable names var_names <- read.csv("data/Label_Pred_values_IR.csv") var_names <- names(var_names) # Wine labels wine_label <- read.csv("data/Label_Wine_samples.csv", header = FALSE) wine_label <- unname(unlist(wine_label)) # Wine data set data_set_wine <- read.csv( "data/Pred_values.csv", header = FALSE, row.names = wine_label, col.names = var_names ) head(data_set_wine) ## Ethanol TotalAcid VolatileA MalicAcid pH LacticAc ReSugar CitricAcid ## ARG-BNS1 13.62 3.54 0.29 0.89 3.71 0.78 1.46 0.31 ## ARG-DDA1 14.06 3.74 0.59 0.24 3.73 1.25 2.42 0.18 ## ARG-FFL1 13.74 3.27 0.47 -0.07 3.87 1.13 1.52 0.39 ## ARG-FLM1 13.95 3.66 0.47 0.09 3.79 1.00 4.17 0.41 ## ARG-ICR1 14.47 3.66 0.38 0.61 3.70 0.81 1.25 0.14 ## ARG-SAL1 14.61 3.45 0.52 0.16 3.92 1.76 1.40 0.10 ## CO2 Density FolinC Glycerol Methanol TartaricA ## ARG-BNS1 85.61 0.99 60.92 9.72 0.16 1.74 ## ARG-DDA1 175.20 1.00 70.64 10.05 0.20 1.58 ## ARG-FFL1 513.74 0.99 63.59 10.92 0.18 1.24 ## ARG-FLM1 379.40 1.00 73.30 9.69 0.23 2.26 ## ARG-ICR1 154.88 0.99 71.69 10.81 0.20 1.22 ## ARG-SAL1 156.30 0.99 71.79 10.19 0.19 0.90
Note that wine samples are marked as the row names of the data frame, and each sample has associated the values for each response (data columns). Put attention to this when calculating covariance matrices or using functions to perform PCA, do you want to reduce de dimensions respect to the rows or respect the columns? Here we are interested on reduce the number of dimensions respect to the responses and try to spot if there are some similarities or differences between wine samples.
Center each variable and divide by the standard deviation
Subtract the mean for each variable (columns), and divide it for its standard deviation:
library(purrr) # Means for each variable var_means <- unlist(map(data_set_wine, mean)) # Standard deviation for each variable var_sd <- unlist(map(data_set_wine, sd)) # Center each variable data_set_wine_2 <- map2( data_set_wine, var_means, .f = function(x, mean) x - mean ) # Divide by the standard deviation of each variable data_set_wine_2 <- map2( data_set_wine_2, var_sd, .f = function(x, sd) x / sd ) # Make a matrix from the previous list data_set_wine_2 <- as.matrix(data.frame(data_set_wine_2))
Each row in the transformed data corresponds to the same wine sample in the the original data.
The first six rows if the transformed data look like this:
head(data_set_wine_2) ## Ethanol TotalAcid VolatileA MalicAcid pH LacticAc ## [1,] -0.63678490 -0.40870999 -0.7811641 2.2627949 0.18823560 -0.88226788 ## [2,] 0.29224039 0.01096889 1.3231961 -0.5694123 0.40060358 0.38388752 ## [3,] -0.38341492 -0.97527630 0.4814521 -1.9201573 1.88717937 0.06061381 ## [4,] 0.05998306 -0.15690246 0.4814521 -1.2229986 1.03770749 -0.28959935 ## [5,] 1.15792166 -0.15690246 -0.1498560 1.0427673 0.08205162 -0.80144937 ## [6,] 1.45351897 -0.59756526 0.8321787 -0.9179917 2.41810183 1.75780072 ## ReSugar CitricAcid CO2 Density FolinC Glycerol ## [1,] -0.5311188 1.1594920 -1.6049333 -0.7844233 -0.1371321 -0.525885112 ## [2,] 0.1996454 0.1084153 -0.9482843 1.2458487 1.2376582 -0.160707984 ## [3,] -0.4854461 1.8063082 1.5330409 -0.7844233 0.2405111 0.802031811 ## [4,] 1.5317676 1.9680124 0.5483973 1.2458487 1.6138874 -0.559083800 ## [5,] -0.6909735 -0.2149929 -1.0972195 -0.7844233 1.3861700 0.680306454 ## [6,] -0.5767916 -0.5384011 -1.0868116 -0.7844233 1.4003137 -0.005784994 ## Methanol TartaricA ## [1,] -1.1081037 0.43877622 ## [2,] -0.0870238 -0.03865555 ## [3,] -0.5975635 -1.05319837 ## [4,] 0.6787860 1.99042974 ## [5,] -0.0870238 -1.11287729 ## [6,] -0.3422939 -2.06774119
Dividing by the standard deviations is a way the give to each variable the same importance despite its range, magnitude, and/or measurement scale. Other transformations are possible besides dividing by the standard deviation, which can be applied relying on your data. See the resources at the end of this post if you are interested.
Calculate the covariance matrix
Multiply the data (as a matrix) by its transpose:
# Calculate the covariance matrix cov_wine <- (t(data_set_wine_2) %*% data_set_wine_2) / (nrow(data_set_wine_2) - 1) cov_wine[1:5, 1:5] ## Ethanol TotalAcid VolatileA MalicAcid pH ## Ethanol 1.00000000 0.3262321 0.2028382 0.04166778 0.1697347 ## TotalAcid 0.32623209 1.0000000 0.4660575 -0.26067574 -0.3518303 ## VolatileA 0.20283816 0.4660575 1.0000000 -0.74628880 0.3011611 ## MalicAcid 0.04166778 -0.2606757 -0.7462888 1.00000000 -0.2929542 ## pH 0.16973470 -0.3518303 0.3011611 -0.29295421 1.0000000
Here I just showing you the first five rows and columns of the covariance matrix. Remember, if you want to see all the matrix you can use the function View
as View(cov_wine)
.
Each value on cov_wine
matrix has the same interpretation as in the example with two variables, diagonal values are the variances of each variable, and values out of the diagonal are covariances between variables. As you can see all variances are equal to 1. This is precisely the effect of centering, by subtracting the mean, and dividing by the standard deviation.
Obtain the eigenvalues and eigenvectors from the covariance matrix
Use the function eigen
to obtain the eigenvectors and their egeinvalues:
# eigen() to obtain eigenvalues and eigenvectors eg_wine <- eigen(cov_wine) # Eigenvalues eg_vals <- eg_wine$values # Eigenvectors eg_vecs <- eg_wine$vectors
The number of eigenvectors and eigenvalues is the same as the number of variables in the original data set:
# Number of eigenvalues length(eg_vals) ## [1] 14 # Number of eigenvectors ncol(eg_vecs) ## [1] 14
Make a Scree Plot
Calculate the percentage of variation for each component and make a scree plot:
# Calculate variances from each eigenvalue eg_vars <- eg_vals / (nrow(data_set_wine_2) - 1) # Data frame with variance percenatges vars_perc <- data.frame( PC = unlist(map(1:14, function(x) paste0("PC", x))), PER = round((eg_vars * 100) / sum(eg_vars), 4) ) # Scree plot ggplot( vars_perc, aes(x = reorder(PC, order(PER, decreasing = TRUE)), y = PER) ) + geom_col(width = 0.5, color = "black") + xlab("Principal component") + ylab("Percentage of variation (%)") + theme_classic()
Represent the data in lower dimensions
Ideally, if PC1 and PC2 were to gather most of the variation, let’s say more than 90%, it would possible to make a good representation of the data in a 2D scatter plot. Since real data is mostly never ideal, at this case PC1, PC2, PC3, and PC4 account for 73% of the variation, so I’m going to try to spot clusters making two scatter plots, the first with PC1 and PC2, and the second with PC3 and PC4.
First, change the basis of the transformed data into that stated by the eigenvectors:
# Change the basis of the data data_set_wine_eb <- data_set_wine_2 %*% solve(eg_vecs) # Transfrom to a data frame data_set_wine_eb <- data.frame(data_set_wine_eb) colnames(data_set_wine_eb) <- vars_perc$PC # Add a column with the origin of each wine sample data_set_wine_eb <- data_set_wine_eb %>% mutate( WineSample = unlist(map(wine_label, function(x) substr(x, 1, 3))) ) %>% relocate(WineSample) head(data_set_wine_eb) ## WineSample PC1 PC2 PC3 PC4 PC5 ## 1 ARG -0.8340408 0.8560257 1.2477283 -0.2946354 -2.64962606 ## 2 ARG 0.9182495 -0.7520145 0.8955902 -0.3967229 0.06460487 ## 3 ARG 2.0109822 0.2126599 -0.4739304 -1.3446676 0.96328060 ## 4 ARG 2.8454813 -1.9323804 0.5129401 -0.2805325 -0.56882676 ## 5 ARG -0.4027013 0.7424389 0.3544802 -1.3925516 -0.86093254 ## 6 ARG 0.4921766 0.6976268 0.9133225 -1.3858705 0.58195738 ## PC6 PC7 PC8 PC9 PC10 PC11 ## 1 -0.09972511 0.6839127 -0.64470453 1.30296999 0.08192591 0.90864847 ## 2 -1.15201976 -0.6156587 0.09652472 -0.66508130 0.22293709 -0.44745146 ## 3 -0.97714541 0.7454403 1.13873701 -0.36070456 -1.80323468 1.57277920 ## 4 0.04685162 0.9391345 1.05427365 0.06617118 -0.72767986 -1.40159736 ## 5 -0.43249796 -1.2152728 0.19062688 1.02200106 0.80693595 -0.03070849 ## 6 -1.12762870 -0.7447312 2.12377275 -1.45181400 0.27437060 1.29332051 ## PC12 PC13 PC14 ## 1 0.09448821 0.2045349 0.2889872 ## 2 0.60237715 -1.1576675 -0.2344966 ## 3 -0.74179131 0.9287338 -0.2968222 ## 4 0.97819193 -0.1854973 0.5890199 ## 5 -0.89310589 -0.3806547 -1.1129547 ## 6 -1.49226046 -1.9552497 -1.2846205
Now, let’s make both scatter plots just taking the values from PC1, PC2, PC3, and PC4.
# Scatter plot for PC1 and PC2 pc12 <- ggplot( data_set_wine_eb, aes(PC1, PC2, color = WineSample, shape = WineSample) ) + geom_point(size = 3) + ggtitle("PC1 and PC2") + xlab("PC1 (24.4%)") + ylab("PC2 (21.3%)") + theme_classic() + theme(legend.position = "none") # Scatter plot for PC3 and PC4 pc34 <- ggplot( data_set_wine_eb, aes(PC3, PC4, color = WineSample, shape = WineSample) ) + geom_point(size = 3) + scale_color_discrete( name = "Wine origin", labels = c("Argentina", "Australia", "Chile", "South Africa") ) + scale_shape_discrete( name = "Wine origin", labels = c("Argentina", "Australia", "Chile", "South Africa") ) + ggtitle("PC3 and PC4") + xlab("PC3 (17.5%)") + ylab("PC4 (10.0%)") + theme_classic() # Both graphs side by side ggarrange(pc12, pc34, widths = c(1.5, 2))
Loading scores
Elements of each eigenvector represent the loading for each variable on the corresponding principal component:
# Data frame with loading scores loads_wine <- data.frame(eg_vecs) colnames(loads_wine) <- vars_perc$PC rownames(loads_wine) <- var_names head(loads_wine) ## PC1 PC2 PC3 PC4 PC5 ## Ethanol -0.2050720 -0.3452884 0.28833198 -0.33833697 0.001102612 ## TotalAcid -0.1457210 -0.4545803 -0.06374909 0.42139270 0.041508880 ## VolatileA 0.2959952 -0.4418538 0.03338975 0.08430388 -0.010891943 ## MalicAcid -0.3401714 0.3173079 0.17907611 -0.04002812 -0.252870228 ## pH 0.2413635 -0.1030772 -0.04459655 -0.66472873 -0.170592676 ## LacticAc 0.3462627 -0.3745378 -0.16738658 0.11850141 -0.108643313 ## PC6 PC7 PC8 PC9 PC10 ## Ethanol -0.02992588 0.106645513 0.40565761 0.2220668 -0.0736428 ## TotalAcid -0.10721872 0.109080109 -0.15801516 0.1647662 -0.1264076 ## VolatileA 0.12676353 -0.117134956 -0.01066998 -0.2323371 0.1969852 ## MalicAcid 0.09495473 0.247290148 -0.35870967 0.1792700 -0.2720317 ## pH 0.01717167 -0.191605072 -0.30040129 0.3352702 0.2882746 ## LacticAc 0.05496474 0.008009813 -0.10169448 0.1377493 -0.2860564 ## PC11 PC12 PC13 PC14 ## Ethanol 0.52247393 -0.22486081 0.29025391 0.04195694 ## TotalAcid -0.09209975 -0.25488777 -0.01537897 -0.65083548 ## VolatileA 0.19114512 -0.33800897 -0.59864003 0.27743084 ## MalicAcid -0.07338526 -0.53559762 -0.18854505 0.23066288 ## pH -0.21200578 -0.09998927 -0.02838824 -0.28223548 ## LacticAc -0.32247769 -0.11583440 0.49548494 0.45696212
Making a scatter plot with these value can help to see patterns of variation, and/or explain why a certain grouping is observed in scatter plots for principal components (previous figure):
# Scatter plot with loadings of PC1 and PC2 ld_pc12 <- ggplot(loads_wine, aes(PC1, PC2)) + geom_point(color = "blue", size = 2) + geom_vline(xintercept = 0) + geom_hline(yintercept = 0) + geom_text(aes(label = rownames(loads_wine)), hjust = -.2) + ggtitle("Loadings for PC1 and PC2") + xlim(c(-.7, .7)) + ylim(c(-.7, .7)) + xlab("PC1 (24.4%)") + ylab("PC2 (21.3%)") + theme_classic() # Scatter plot with loadings of PC3 and PC4 ld_pc34 <- ggplot(loads_wine, aes(PC3, PC4)) + geom_point(color = "blue", size = 2) + geom_vline(xintercept = 0) + geom_hline(yintercept = 0) + geom_text(aes(label = rownames(loads_wine)), hjust = -.2) + ggtitle("Loadings for PC3 and PC4") + xlim(c(-.7, .7)) + ylim(c(-.7, .7)) + xlab("PC3 (17.5%)") + ylab("PC4 (10.0%)") + theme_classic() # Both graphs side by side ggarrange(ld_pc12, ld_pc34)
Wonderful resources
- As a rookie, I’ve had to learn linear algebra to understand, and visualize, the core of PCA. The video series, on YouTube, by 3Blue1Brown are a terrific place to start with linear algebra:
- Intuitive explanations of PCA can be found in other YouTube channels:
Principal Component Analysis (PCA)
StatQuest: Principal Component Analysis (PCA), Step-by-Step
- If you want to go deeper, here there are two papers aimed to chemometrics and metabolomics:
GitHub repository with all the code of this post
You can find all the code on this post in the next repository:
Principal Component Analysis from Scratch
Thanks a lot for visit this site 🤓
This work is licensed under a Creative Commons Attribution 4.0 International License.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.