Visualizing Principal Components

December 22, 2012
By

(This article was first published on Systematic Investor » R, and kindly contributed to R-bloggers)

Principal Component Analysis (PCA) is a procedure that converts observations into linearly uncorrelated variables called principal components (Wikipedia). The PCA is a useful descriptive tool to examine your data. Today I will show how to find and visualize Principal Components.

Let’s look at the components of the Dow Jones Industrial Average index over 2012. First, I will download the historical prices and sector infromation for all components of the Dow Jones Industrial Average index.

###############################################################################
# Load Systematic Investor Toolbox (SIT)
# http://systematicinvestor.wordpress.com/systematic-investor-toolbox/
###############################################################################
setInternet2(TRUE)
con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb'))
    source(con)
close(con)

	#*****************************************************************
	# Find Sectors for each company in DOW 30
	#****************************************************************** 
	tickers = spl('XLY,XLP,XLE,XLF,XLV,XLI,XLB,XLK,XLU')
	tickers.desc = spl('ConsumerCyclicals,ConsumerStaples,Energy,Financials,HealthCare,Industrials,Materials,Technology,Utilities')
	
	sector.map = c()
	for(i in 1:len(tickers)) {
		sector.map = rbind(sector.map, 
				cbind(sector.spdr.components(tickers[i]), tickers.desc[i])
			)
	}
	colnames(sector.map) = spl('ticker,sector')

	#*****************************************************************
	# Load historical data
	#****************************************************************** 
	load.packages('quantmod')	
	tickers = dow.jones.components()
	
	sectors = factor(sector.map[ match(tickers, sector.map[,'ticker']), 'sector'])
		names(sectors) = tickers
	
	data <- new.env()
	getSymbols(tickers, src = 'yahoo', from = '2000-01-01', env = data, auto.assign = T)
		for(i in ls(data)) data[[i]] = adjustOHLC(data[[i]], use.Adjusted=T)	
	
	bt.prep(data, align='keep.all', dates='2012')
	
	# re-order sectors, because bt.prep can change the order of tickers
	sectors = sectors[data$symbolnames]
	
	# save data for later examples
	save(data, tickers, sectors, file='bt.pca.test.Rdata')

Next, let’s run the Principal Component Analysis (PCA) on the companies returns during 2012 and plot percentage of variance explained for each principal component.

	#*****************************************************************
	# Principal component analysis (PCA), for interesting discussion
	# http://machine-master.blogspot.ca/2012/08/pca-or-polluting-your-clever-analysis.html
	#****************************************************************** 
	prices = data$prices	
	ret = prices / mlag(prices) - 1
	
	p = princomp(na.omit(ret))
	
	loadings = p$loadings[]
	p.variance.explained = p$sdev^2 / sum(p$sdev^2)

	# plot percentage of variance explained for each principal component	
	barplot(100*p.variance.explained, las=2, xlab='', ylab='% Variance Explained')

plot1.png.small

The first principal component, usually it is market returns, explains around 45% of variance during 2012.

Next let’s plot all companies loadings on the first and second principal components and highlight points according to the sector they belong.

	#*****************************************************************
	# 2-D Plot
	#****************************************************************** 		
	x = loadings[,1]
	y = loadings[,2]
	z = loadings[,3]
	cols = as.double(sectors)
	
	# plot all companies loadings on the first and second principal components and highlight points according to the sector they belong
	plot(x, y, type='p', pch=20, col=cols, xlab='Comp.1', ylab='Comp.2')
	text(x, y, data$symbolnames, col=cols, cex=.8, pos=4)
	
	legend('topright', cex=.8,  legend = levels(sectors), fill = 1:nlevels(sectors), merge = F, bty = 'n') 

plot2.png.small

Please notice that the companies in the same sector tend to group together on the plot.

Next, let’s go one step further and create a 3D plot using first, second, and third principal components

	#*****************************************************************
	# 3-D Plot, for good examples of 3D plots
	# http://statmethods.wordpress.com/2012/01/30/getting-fancy-with-3-d-scatterplots/
	#****************************************************************** 				
	load.packages('scatterplot3d') 
	
	# plot all companies loadings on the first, second, and third principal components and highlight points according to the sector they belong
	s3d = scatterplot3d(x, y, z, xlab='Comp.1', ylab='Comp.2', zlab='Comp.3', color=cols, pch = 20)
		
	s3d.coords = s3d$xyz.convert(x, y, z)
	text(s3d.coords$x, s3d.coords$y, labels=data$symbolnames, col=cols, cex=.8, pos=4)
		
	legend('topleft', cex=.8,  legend = levels(sectors), fill = 1:nlevels(sectors), merge = F, bty = 'n') 

plot3.png.small

The 3D chart does add a bit of extra info, but I find the 2D chart easier to look at.

In the next post, I will demonstrate clustering based on the selected Principal components and after that I want to explore the interesting discussion presented in the using PCA for spread trading post.

Happy Holidays

To view the complete source code for this example, please have a look at the bt.pca.test() function in bt.test.r at github.


To leave a comment for the author, please follow the link and comment on his blog: Systematic Investor » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.