**Systematic Investor » R**, and kindly contributed to R-bloggers)

Principal Component Analysis (PCA) is a procedure that converts observations into linearly uncorrelated variables called principal components (Wikipedia). The PCA is a useful descriptive tool to examine your data. Today I will show how to find and visualize Principal Components.

Let’s look at the components of the Dow Jones Industrial Average index over 2012. First, I will download the historical prices and sector infromation for all components of the Dow Jones Industrial Average index.

############################################################################### # Load Systematic Investor Toolbox (SIT) # http://systematicinvestor.wordpress.com/systematic-investor-toolbox/ ############################################################################### setInternet2(TRUE) con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb')) source(con) close(con) #***************************************************************** # Find Sectors for each company in DOW 30 #****************************************************************** tickers = spl('XLY,XLP,XLE,XLF,XLV,XLI,XLB,XLK,XLU') tickers.desc = spl('ConsumerCyclicals,ConsumerStaples,Energy,Financials,HealthCare,Industrials,Materials,Technology,Utilities') sector.map = c() for(i in 1:len(tickers)) { sector.map = rbind(sector.map, cbind(sector.spdr.components(tickers[i]), tickers.desc[i]) ) } colnames(sector.map) = spl('ticker,sector') #***************************************************************** # Load historical data #****************************************************************** load.packages('quantmod') tickers = dow.jones.components() sectors = factor(sector.map[ match(tickers, sector.map[,'ticker']), 'sector']) names(sectors) = tickers data <- new.env() getSymbols(tickers, src = 'yahoo', from = '2000-01-01', env = data, auto.assign = T) for(i in ls(data)) data[[i]] = adjustOHLC(data[[i]], use.Adjusted=T) bt.prep(data, align='keep.all', dates='2012') # re-order sectors, because bt.prep can change the order of tickers sectors = sectors[data$symbolnames] # save data for later examples save(data, tickers, sectors, file='bt.pca.test.Rdata')

Next, let’s run the Principal Component Analysis (PCA) on the companies returns during 2012 and plot percentage of variance explained for each principal component.

#***************************************************************** # Principal component analysis (PCA), for interesting discussion # http://machine-master.blogspot.ca/2012/08/pca-or-polluting-your-clever-analysis.html #****************************************************************** prices = data$prices ret = prices / mlag(prices) - 1 p = princomp(na.omit(ret)) loadings = p$loadings[] p.variance.explained = p$sdev^2 / sum(p$sdev^2) # plot percentage of variance explained for each principal component barplot(100*p.variance.explained, las=2, xlab='', ylab='% Variance Explained')

The first principal component, usually it is market returns, explains around 45% of variance during 2012.

Next let’s plot all companies loadings on the first and second principal components and highlight points according to the sector they belong.

#***************************************************************** # 2-D Plot #****************************************************************** x = loadings[,1] y = loadings[,2] z = loadings[,3] cols = as.double(sectors) # plot all companies loadings on the first and second principal components and highlight points according to the sector they belong plot(x, y, type='p', pch=20, col=cols, xlab='Comp.1', ylab='Comp.2') text(x, y, data$symbolnames, col=cols, cex=.8, pos=4) legend('topright', cex=.8, legend = levels(sectors), fill = 1:nlevels(sectors), merge = F, bty = 'n')

Please notice that the companies in the same sector tend to group together on the plot.

Next, let’s go one step further and create a 3D plot using first, second, and third principal components

#***************************************************************** # 3-D Plot, for good examples of 3D plots # http://statmethods.wordpress.com/2012/01/30/getting-fancy-with-3-d-scatterplots/ #****************************************************************** load.packages('scatterplot3d') # plot all companies loadings on the first, second, and third principal components and highlight points according to the sector they belong s3d = scatterplot3d(x, y, z, xlab='Comp.1', ylab='Comp.2', zlab='Comp.3', color=cols, pch = 20) s3d.coords = s3d$xyz.convert(x, y, z) text(s3d.coords$x, s3d.coords$y, labels=data$symbolnames, col=cols, cex=.8, pos=4) legend('topleft', cex=.8, legend = levels(sectors), fill = 1:nlevels(sectors), merge = F, bty = 'n')

The 3D chart does add a bit of extra info, but I find the 2D chart easier to look at.

In the next post, I will demonstrate clustering based on the selected Principal components and after that I want to explore the interesting discussion presented in the using PCA for spread trading post.

Happy Holidays

To view the complete source code for this example, please have a look at the bt.pca.test() function in bt.test.r at github.

**leave a comment**for the author, please follow the link and comment on their blog:

**Systematic Investor » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...