How to perform PCA on R

[This article was first published on R – CoolStatsBlog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a practical tutorial on performing PCA on R. If you would like to understand how PCA works, please see my plain English explainer here.

Reminder: Principal Component Analysis (PCA) is a method used to reduce the number of variables in a dataset.

We are using R’s USArrests dataset, a dataset from 1973 showing, for each US state, the:

  1. rate per 100,000 residents of murder
  2. rate per 100,000 residents of rape
  3. rate per 100,000 residents of assault
  4. % of the population that is urban

crime

Now, we will simplify the data into two-variables data. This does not mean that we are eliminating two variables and keeping two; it means that we are replacing the four variables with two brand new ones called “principal components”.

This time we will use R’s princomp function to perform PCA.

Preamble: you will need the stats package.

Step 1: Standardize the data. You may skip this step if you would rather use princomp’s inbuilt standardization tool*.

Step 2: Run pca=princomp(USArrests, cor=TRUE) if your data needs standardizing / princomp(USArrests) if your data is already standardized.

Step 3: Now that R has computed 4 new variables (“principal components”), you can choose the two (or one, or three) principal components with the highest variances.

You can run summary(pca) to do this. The output will look like this:
summarypca

As you can see, principal components 1 and 2 have the highest standard deviation / variance, so we should use them.

Step 4: Finally, to obtain the actual principal component coordinates (“scores”) for each state, run pca$scores:
score

Step 5: To produce the biplot, a visualization of the principal components against the original variables, run biplot(pca):
biplot

The closeness of the Murder, Assault, Rape arrows indicates that these three types of crime are, intuitively, correlated. There is also some correlation between urbanization and incidence of rape; the urbanization-murder correlation is weaker.

*princomp will turn your data into z-scores (i.e. subtract the mean, then divide by the standard deviation). But in doing so, one is not just standardizing the data, but also rescaling it. I do not see the need to rescale, so I choose to manually translate the data onto a standard range of [0,1] using the equation:

\frac{x_{i}-x_{min}}{x_{max}-x_{min}}

Abbas Keshvani


To leave a comment for the author, please follow the link and comment on their blog: R – CoolStatsBlog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)