P is for Principal Components Analysis (PCA)

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
P is for Principal Components Analysis This month, I’ve talked about some approaches for testing the underlying latent variables (or factors) in a set of measurement data. Confirmatory factor analysis is one method of examining latent variables when you know ahead of time which observed variables are associated with which factor. But it isn’t the only way to do this. Sometimes you need to test for the presence of latent variables. Exploratory factor analysis is one way, but if I’m doing exploratory latent variable analysis, I tend to prefer using an approach called principal components analysis instead.

Both principal components analysis and exploratory factor analysis have the goal of reducing the amount of data you have to work with, by looking for how individual observed variables hang together. While exploratory factor analysis has the goal of explaining why observed variables relate to each other (because of the influence of the latent variable on the observed variables), principal components analysis has the goal of simply describing the observed variables that relate to each other and appear to measure the same thing (together they result in a score on the latent variable). So a factor analysis model would be drawn with arrows going from the factor to the observed variables, while a principal components analysis would be drawn with arrows going from the observed variables to the component.

More mathematically speaking, principal components analysis explains variance while factor analysis explains covariance (between items). In Winsteps, the Rasch program I use most frequently, the default dimension reduction technique is principal components analysis (also called PCA), and its goal is mainly to ensure that you aren’t violating a key assumption of Rasch: that each item measures only one underlying latent construct. We refer to this as the assumption of unidimensionality. If the PCA shows the data violate this assumption, the PCA results can be used to divide items up into subscales, and then perform Rasch analysis on each subscale separately.

PCA output includes many important components, which I’ll go through once we get the analysis. But two I want to highlight first are the eigenvalues and factor loadings. Eigenvalues are decompositions of variance, that help to uncover exactly how many factors are present in the data. You’ll receive multiple eigenvalues in your analysis, which are contrasts: splitting up the items into the different components based on patterns of correlation. If the first eigenvalue is high and the rest are small (less than 1 or 2, depending on who you ask), that means there is only one component. You want the number of high eigenvalues to be less than the number of items in your measure. If the number of high eigenvalues is equal to the number of items, that means these items don’t hang together and aren’t measuring the same thing. Basically, this information tells you how many subscales appear to be in your data.

Next is factor loadings, which range from 0 to 1 (and can be positive or negative) and tell you how strongly an item relates to a specific factor or component. Loadings greater than 0.3 or 0.4 (again, based on who you ask) mean an item should be associated with that factor. For most measures, you only want an item to have a high loading on one factor. So you want each item to have a high loading (closer to 1 is better) on 1 factor and loadings close to 0 on the rest of the factors.

You can easily conduct PCA using the psych package. I was originally going to write today’s post on the psych package, but it’s huge and there’s too much to cover. (Yes, I did just include “huge” and “package” in the same sentence on a stats blog.) There’s a great website, maintained by the package developer, William Revelle, where you can learn about the many capabilities offered by psych. I frequently use the psych package to run quick descriptive statistics on a dataset (with the describe function). And I also used psych to demonstrate Cronbach’s alpha way back at the beginning of this month (which feels like it was years ago instead of just a couple weeks). psych is a great general purpose package for psychological researchers with a heavy focus on psychometrics, so I’d definitely add it to your R packages if you haven’t already done so.

So for the sake of simplicity and brevity, I’ll focus today on the principal function in the psych package. To demonstrate, I’ll use the Facebook dataset, since it contains multiple scales I can use with the principal function. (This isn’t how you’d use PCA in practice – since these are developed scales with a known factor structure, confirmatory factor analysis is probably the best approach. But let’s pretend for now that these are simply items created toward new scales and that we’re looking for ways they might hang together.)

We’ll start by installing (if necessary)/loading the psych package, as well as reading in our Facebook dataset.

install.packages("psych")

## Installing package into '/R/win-library/3.4'
## (as 'lib' is unspecified)

library(psych)

## Warning: package 'psych' was built under R version 3.4.4

Facebook<-read.delim("small_facebook_set.txt", header=TRUE)

Now let's dig into the principal function. You can quickly view what arguments a function allows by requesting args:

args(principal)

## function (r, nfactors = 1, residuals = FALSE, rotate = "varimax", 
##     n.obs = NA, covar = FALSE, scores = TRUE, missing = FALSE, 
##     impute = "median", oblique.scores = TRUE, method = "regression", 
##     ...) 
## NULL

r refers to the object you want to analyze - in our case, it will be a data frame, but you could also use a correlation matrix, as long as you also specify n.obs. You can request that the program extract a certain number of factors by setting nfactors. Residuals simply asks if you want the program to report residuals in your output.

The next argument, rotate, is a bit more complicated. There are multiple rotation methods available: "none", "varimax", "quartimax", "promax", "oblimin", "simplimax", and "cluster". But you're probably asking - what exactly is "rotation"? Rotation adds another step into the analysis to help understand and clarify the pattern of loadings. Basically, you can make one of two assumptions about the factors/components in your analysis: they are correlated (oblique) or they are uncorrelated (orthogonal). There are different rotation methods to use depending on which of these assumptions you make. You can learn more about rotation here. If you assume your factors are orthogonal, most would recommend using varimax, which is the default rotation method in the principal function. If you assume your factors are correlated, promax is often recommended. Some discourage rotation completely, which is why none is an option. We'll try all three for demonstration purposes.

Setting scores to TRUE requests the program to generate scale/subscale scores on the factor(s) identified. And missing plus impute is used when you have missing values and want the program to score results; it tells the program to impute missing values, with either "median" or "mean", when generating scores. method also relates to the component scores; with "regression", the default, scale/subscale scores are generated with a linear equation. Finally, olique.scores refers to the matrix (structural versus pattern) used in conducting the PCA; this is a more advanced concept that we won't get into now. You can leave this argument out at the moment. But maybe I should dig into this topic more in a future post or two.

We're now ready to try running a PCA with some of the Facebook data. Let's turn once again to our Ruminative Response Scale data. Remember that this measure can generate a total score, as well as scores on 3 subscales: Depression-Related Rumination, Brooding, and Reflecting. To make it easy to access just the RRS items, I'm going to create another data frame that copies over responses on these items. Then I'll run a PCA, using the default rotation method, on this data frame.

RRS<-Facebook[,3:24]
RRSpca<-principal(RRS)

Now I have an object called RRSpca that holds the results of this analysis. First, let's take a look at our eigenvalues:

RRSpca$values

##  [1] 8.8723501 1.6920860 1.1413928 1.1209452 1.0072739 0.8236229 0.7535618
##  [8] 0.6838558 0.6530697 0.6427205 0.5703730 0.5616363 0.4937120 0.4860600
## [15] 0.4218528 0.3830052 0.3388684 0.3185341 0.2863681 0.2675727 0.2468134
## [22] 0.2343256

Most of the variance is accounted for by the first contrast, which contains all of the items on a single component. This supports reporting a total score for the RRS. It doesn't support the presence of the 3 subscales, though. Let's take a look at the rest of our results:

RRSpca

## Principal Components Analysis
## Call: principal(r = RRS)
## Standardized loadings (pattern matrix) based upon correlation matrix
##        PC1   h2   u2 com
## Rum1  0.62 0.38 0.62   1
## Rum2  0.53 0.28 0.72   1
## Rum3  0.50 0.25 0.75   1
## Rum4  0.57 0.32 0.68   1
## Rum5  0.57 0.32 0.68   1
## Rum6  0.64 0.41 0.59   1
## Rum7  0.64 0.41 0.59   1
## Rum8  0.64 0.41 0.59   1
## Rum9  0.62 0.38 0.62   1
## Rum10 0.66 0.44 0.56   1
## Rum11 0.61 0.37 0.63   1
## Rum12 0.35 0.12 0.88   1
## Rum13 0.55 0.30 0.70   1
## Rum14 0.70 0.49 0.51   1
## Rum15 0.69 0.47 0.53   1
## Rum16 0.76 0.58 0.42   1
## Rum17 0.74 0.54 0.46   1
## Rum18 0.71 0.50 0.50   1
## Rum19 0.71 0.50 0.50   1
## Rum20 0.75 0.56 0.44   1
## Rum21 0.57 0.33 0.67   1
## Rum22 0.71 0.51 0.49   1
## 
##                 PC1
## SS loadings    8.87
## Proportion Var 0.40
## 
## Mean item complexity =  1
## Test of the hypothesis that 1 component is sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.08 
##  with the empirical chi square  725.56  with prob <  3.2e-58 
## 
## Fit based upon off diagonal values = 0.96

All of the factor loadings, shown in the column PC1, are greater than 0.3. The lowest loading is 0.35 for item 12. I tend to use a cut-off of 0.4, so I might raise an eyebrow at this item, and if I were developing a scale, I might consider dropping this item. The column h2 is a measure of communalities, and u2 is uniqueness; you'll also notice they add up to 1. Communalities refers to shared variance with the other items, while uniqueness is variance not explained by the other items, but that could be explained by the latent variable as well as measurement error. The last column is complexity, which we'll ignore for now.

Near the bottom of the output, you'll see SS (sum of squares) loadings, which is equal to your first eigenvalue, as well as proportion var - this tells us that 40% of the variance in responses can be explained by the first latent variable. Finally, you'll see some familiar fit indices - RMSEA, chi-square, and fit based upon off diagnoal values (a goodness of fit statistic, with values closer to 1 indicating better fit).

Overall, these values support moderate to good fit. Since there's only 1 factor/component, rotation method doesn't really matter; remember, rotation refers to the correlations between factors or components, and we only have one. But let's try running these same data again, using the varimax rotation, and forcing the program to extract 3 components.

RRSpca_3<-principal(RRS, nfactors = 3)
RRSpca_3$values

##  [1] 8.8723501 1.6920860 1.1413928 1.1209452 1.0072739 0.8236229 0.7535618
##  [8] 0.6838558 0.6530697 0.6427205 0.5703730 0.5616363 0.4937120 0.4860600
## [15] 0.4218528 0.3830052 0.3388684 0.3185341 0.2863681 0.2675727 0.2468134
## [22] 0.2343256

RRSpca_3

## Principal Components Analysis
## Call: principal(r = RRS, nfactors = 3)
## Standardized loadings (pattern matrix) based upon correlation matrix
##         RC1  RC3   RC2   h2   u2 com
## Rum1   0.59 0.19  0.24 0.45 0.55 1.5
## Rum2   0.05 0.65  0.23 0.48 0.52 1.3
## Rum3   0.43 0.43 -0.09 0.38 0.62 2.1
## Rum4   0.24 0.73 -0.04 0.58 0.42 1.2
## Rum5   0.60 0.21  0.10 0.42 0.58 1.3
## Rum6   0.37 0.63  0.06 0.54 0.46 1.6
## Rum7   0.41 0.17  0.57 0.53 0.47 2.0
## Rum8   0.32 0.44  0.36 0.42 0.58 2.8
## Rum9   0.19 0.69  0.18 0.54 0.46 1.3
## Rum10  0.24 0.54  0.38 0.50 0.50 2.2
## Rum11  0.23 0.18  0.74 0.63 0.37 1.3
## Rum12 -0.02 0.04  0.72 0.52 0.48 1.0
## Rum13  0.58 0.17  0.14 0.39 0.61 1.3
## Rum14  0.29 0.69  0.21 0.61 0.39 1.5
## Rum15  0.70 0.24  0.18 0.58 0.42 1.4
## Rum16  0.54 0.47  0.27 0.59 0.41 2.5
## Rum17  0.70 0.15  0.40 0.67 0.33 1.7
## Rum18  0.69 0.25  0.23 0.59 0.41 1.5
## Rum19  0.57 0.47  0.13 0.56 0.44 2.1
## Rum20  0.44 0.33  0.56 0.61 0.39 2.6
## Rum21  0.27 0.10  0.71 0.59 0.41 1.3
## Rum22  0.43 0.41  0.40 0.51 0.49 3.0
## 
##                        RC1  RC3  RC2
## SS loadings           4.45 4.03 3.22
## Proportion Var        0.20 0.18 0.15
## Cumulative Var        0.20 0.39 0.53
## Proportion Explained  0.38 0.34 0.27
## Cumulative Proportion 0.38 0.73 1.00
## 
## Mean item complexity =  1.8
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.06 
##  with the empirical chi square  463.32  with prob <  1.9e-29 
## 
## Fit based upon off diagonal values = 0.97

I showed the eigenvalues once again, so you can see that they're the same. These are based on the actual variance in the data, and not the exact solution you're forcing on the data. Taking a look at the factor loadings, we see that certain items loaded highly on more than 1. In fact, this is when we look at complexity, which tells us how many factors an item loads on. Values close to 1 mean an item loads cleanly on one and only one factor. Values 2 or greater suggest an item is measuring two or more latent constructs simultaneously. Depending on the scale you're trying to create and how you plan to use it, this could be very problematic.

However, we do know that these subscales are correlated with each other. Let's see what happens when we use an oblique rotation instead.

RRSpca_3ob<-principal(RRS, nfactors = 3, rotate="promax")
RRSpca_3ob

## Principal Components Analysis
## Call: principal(r = RRS, nfactors = 3, rotate = "promax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##         RC1   RC3   RC2   h2   u2 com
## Rum1   0.67 -0.06  0.08 0.45 0.55 1.0
## Rum2  -0.29  0.79  0.15 0.48 0.52 1.3
## Rum3   0.41  0.39 -0.30 0.38 0.62 2.8
## Rum4   0.00  0.84 -0.23 0.58 0.42 1.1
## Rum5   0.70 -0.01 -0.08 0.42 0.58 1.0
## Rum6   0.20  0.64 -0.13 0.54 0.46 1.3
## Rum7   0.36 -0.05  0.51 0.53 0.47 1.8
## Rum8   0.15  0.37  0.25 0.42 0.58 2.1
## Rum9  -0.09  0.78  0.04 0.54 0.46 1.0
## Rum10 -0.01  0.54  0.28 0.50 0.50 1.5
## Rum11  0.06  0.02  0.75 0.63 0.37 1.0
## Rum12 -0.21 -0.05  0.82 0.52 0.48 1.1
## Rum13  0.67 -0.06 -0.03 0.39 0.61 1.0
## Rum14  0.03  0.74  0.05 0.61 0.39 1.0
## Rum15  0.80 -0.03 -0.03 0.58 0.42 1.0
## Rum16  0.45  0.33  0.08 0.59 0.41 1.9
## Rum17  0.79 -0.18  0.23 0.67 0.33 1.3
## Rum18  0.77 -0.02  0.03 0.59 0.41 1.0
## Rum19  0.52  0.34 -0.09 0.56 0.44 1.8
## Rum20  0.31  0.15  0.46 0.61 0.39 2.0
## Rum21  0.16 -0.11  0.73 0.59 0.41 1.1
## Rum22  0.31  0.28  0.27 0.51 0.49 3.0
## 
##                        RC1  RC3  RC2
## SS loadings           4.76 4.02 2.93
## Proportion Var        0.22 0.18 0.13
## Cumulative Var        0.22 0.40 0.53
## Proportion Explained  0.41 0.34 0.25
## Cumulative Proportion 0.41 0.75 1.00
## 
##  With component correlations of 
##      RC1  RC3  RC2
## RC1 1.00 0.68 0.52
## RC3 0.68 1.00 0.46
## RC2 0.52 0.46 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.06 
##  with the empirical chi square  463.32  with prob <  1.9e-29 
## 
## Fit based upon off diagonal values = 0.97

In some cases, this improved our solution, bringing complexity below 2. In others, such as for item 3, it made the problem worse. But using this rotation, we only have a few items we'd need to drop if we wanted to ensure a simple solution - one in which each item loads significantly onto only 1 factor. Just for fun, let's see what happens if we use no rotation.

RRSpca_3no<-principal(RRS, nfactors = 3, rotate="none")
RRSpca_3no

## Principal Components Analysis
## Call: principal(r = RRS, nfactors = 3, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        PC1   PC2   PC3   h2   u2 com
## Rum1  0.62  0.06 -0.26 0.45 0.55 1.4
## Rum2  0.53 -0.20  0.41 0.48 0.52 2.2
## Rum3  0.50 -0.35 -0.12 0.38 0.62 1.9
## Rum4  0.57 -0.47  0.21 0.58 0.42 2.2
## Rum5  0.57 -0.07 -0.30 0.42 0.58 1.6
## Rum6  0.64 -0.34  0.09 0.54 0.46 1.6
## Rum7  0.64  0.34 -0.01 0.53 0.47 1.5
## Rum8  0.64  0.02  0.13 0.42 0.58 1.1
## Rum9  0.62 -0.27  0.29 0.54 0.46 1.9
## Rum10 0.66 -0.02  0.25 0.50 0.50 1.3
## Rum11 0.61  0.48  0.19 0.63 0.37 2.1
## Rum12 0.35  0.55  0.30 0.52 0.48 2.3
## Rum13 0.55 -0.01 -0.29 0.39 0.61 1.5
## Rum14 0.70 -0.25  0.24 0.61 0.39 1.5
## Rum15 0.69 -0.03 -0.33 0.58 0.42 1.4
## Rum16 0.76 -0.09 -0.05 0.59 0.41 1.0
## Rum17 0.74  0.20 -0.30 0.67 0.33 1.5
## Rum18 0.71  0.01 -0.30 0.59 0.41 1.3
## Rum19 0.71 -0.20 -0.12 0.56 0.44 1.2
## Rum20 0.75  0.23  0.05 0.61 0.39 1.2
## Rum21 0.57  0.50  0.11 0.59 0.41 2.0
## Rum22 0.71  0.06  0.04 0.51 0.49 1.0
## 
##                        PC1  PC2  PC3
## SS loadings           8.87 1.69 1.14
## Proportion Var        0.40 0.08 0.05
## Cumulative Var        0.40 0.48 0.53
## Proportion Explained  0.76 0.14 0.10
## Cumulative Proportion 0.76 0.90 1.00
## 
## Mean item complexity =  1.6
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.06 
##  with the empirical chi square  463.32  with prob <  1.9e-29 
## 
## Fit based upon off diagonal values = 0.97

We still have items loading onto more than 1 factor, but not nearly as many as when we use varimax rotation. The promax rotation seems to be superior here, but using no rotation produces really similar results. Also, notice that our RMSEA and goodness of fit have improved for the 3 factor solution. But if this were my brand new measure, and these were my pilot data, based on the eigenvalues, I'd probably opt for the 1 factor solution.

Hope you enjoyed this post! I still prefer CFA for this kind of analysis, and in my current job, rarely even use the PCA results provided by the Rasch program - we tend to use our content validation study results as support for dimensionality - but I have used PCA to analyze some survey data we collected. Each analysis technique can be useful in certain ways and for certain situations.
To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)