Visualizing principal components with R and Sochi Olympic Athletes

March 27, 2014
By

(This article was first published on Heuristic Andrew, and kindly contributed to R-bloggers)

Principal Components Analysis (PCA) is used as a dimensionality reduction method. Here we simply explain PCA step-by-step using data about Sochi Olympic Curlers.

It is hard to visualize a high dimensional space. When I took linear algebra, the book and teachers spoke about it as if were easy to visualize a hyperspace, but later when I took the Coursera course Neural Networks for Machine Learning, Geoffrey Hinton gave the wise advise, "To deal with a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly. Everyone does it." In other words, people cannot visualize a high dimensional space, so we use a simpler problem—two dimensions of Olympic athlete data—to explain PCA.

First, we have one dimensional data where the only dimension is the curler's height.

Next, we add a second dimension: the curler's weight. Notice there is a strong correlation between height and weight. Because of this redundancy, two dimensions are not necessary to represent most of the information.

By the way, if you look carefully at the first two images, notice the horizontal placement of the curlers is identical: adding the second axis moves the curlers only vertically.

After performing PCA, there are two principal components. Because we want to simplify two dimensions into one dimension, we ignore the second principal component and plot the data onto the first component as red squares. The black lines join each original point (green) to its projection (red) onto a one-dimensional line.

The blue line illustrates the first principal component. Its on this one-dimensional line that the two-dimensional space is projected.

Now we can show the same projections from the previous graph on its own one-dimensional strip chart, which most of the variation of a two-dimensional space in one dimension.

So in general PCA reduces the number of dimensions by projecting high dimensional data into a lower dimensional space. With higher dimensional data, it is often useful to keep more of the principal components. For graphing, two or three principal components are retained. For other purposes, the optimal number of components may be chosen using a scree plot or the minimum number of components that captures some percentage of the variation, say 90%.

Here is the R code.



# Read data from CSV
# Download from http://www.danasilver.org/static/assets/sochi-2014-athletes/athletes.csv
# See below for faster option.
athletes <- read.csv('athletes.csv')

# Subset data
ath <- athletes[athletes$sport=='Curling',c('height','weight')]
ath <- ath[complete.cases(ath),]

# ALTERNATIVELY instead of downloading
ath <- structure(list(height = c(1.73, 1.78, 1.7, 1.73, 1.71, 1.93,
1.7, 1.69, 1.84, 1.75, 1.83, 1.8, 1.8, 1.64), weight = c(66L,
84L, 74L, 66L, 73L, 80L, 58L, 60L, 88L, 85L, 80L, 71L, 85L, 69L
)), .Names = c("height", "weight"), row.names = c(536L, 624L,
640L, 820L, 930L, 949L, 1191L, 1632L, 1818L, 2349L, 2583L, 2609L,
2641L, 2696L), class = "data.frame")

# Plot 1 Dimension (just height)
png('pca1-stripchart.png')
stripchart(ath$height, col="green", pch=19, cex=2,
xlab="Height (m)",
main="Curlers at Sochi 2014 Winter Olympics")
dev.off()

# Plot 2 Dimensions
x <- as.matrix(ath)
plot2d <- function(col=3)
{
plot (x, asp = 0, col = col, pch = 19, cex = 2,
xlab="Height (m)",
ylab="Weight (kg)",
main="Curlers at Sochi 2014 Winter Olympics")
}
png('pca2-scatterplot.png')
plot2d()
dev.off()

# Perform PCA
pcX <- prcomp(x, retx = TRUE, scale = FALSE, center=TRUE)

# Transform points
transformed <- pcX$x [,1] %*% t (pcX$rotation [1,])
transformed <- scale (transformed, center = -pcX$center, scale = FALSE)

# Plot PCA projection
plot_pca <- function()
{
plot2d()
points (transformed, col = 2, pch = 15, cex = 2)
segments (x [,1],x [,2], transformed [,1], transformed [,2])
}
png('pca3-pca-projection.png')
plot_pca()
dev.off()

# Draw first principal component over scatterplot
png('pca4-first-component-on-scatterplot.png')
plot_pca()
lm.fit <- lm(transformed[,2] ~ transformed[,1])
abline(lm.fit, col="blue", cex=1.5)
dev.off()

# Plot first principal component by itself
png('pca5-first-component-stripchart.png')
stripchart(pcX$x[,1], col="red", cex=2, pch=15,
xlab="First principal component")
dev.off()

This was tested on R 3.0.2 (64-bit). Thank you to Dana Silver for the Sochi athlete data and to cbeleites for explaining how to plot PCA projections with line segments.

To leave a comment for the author, please follow the link and comment on his blog: Heuristic Andrew.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.