Big-Data PCA: 50 years of stock data

June 17, 2011

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In this post, Revolution engineer Sherry LaMonica shows us how to use the RevoScaleR big-data package in Revolution R Enterprise to do principal components analysis on 50 years of stock market data — ed.

Principal components analysis, or PCA, seeks to find a set of orthogonal axes such that the first axis, or first principal component, accounts for as much variability as possible and subsequent axes are chosen to maximize variance while maintaining orthogonality with previous axes. Principal components are typically computed either by a singular value decomposition of the data matrix or an eigenvalue decomposition of a covariance or correlation matrix; the latter permits us to use the RevoScaleR function rxCovCor with the standard R function princomp.

Stock market data for open, high, low, close, and adjusted close from 1962 to 2010 is available from InfoChimps. As you might expect, these data are highly correlated, and principal components analysis can be used for data reduction. We read the original data (a set of 26 comma-separated text files, where each file is represented by a letter in the alphabet) into an .xdf file, NYSE_daily_prices.xdf:

nyseDataDir <- "C:/Users/Sherry/Downloads/NYSE"
dataSourceName <- file.path(nyseDataDir, "NYSE_daily_prices")
dataFileName <- "NYSE_daily_prices.xdf"
append <- "none"
for (i in LETTERS)
       importFile <- paste(dataSourceName, "_", i, ".csv", sep="")
       rxTextToXdf(importFile, dataFileName, stringsAsFactors=TRUE,
       append <- "rows"

The full data set includes 9.2 million observations of daily open-high-low-close data for some 2800 stocks:

> rxGetInfoXdf(dataFileName)
File name: NYSE_daily_prices.xdf
Number of observations: 9211031
Number of variables: 9
Number of blocks: 34 

We will use the rxCor function to calculate the Pearson's correlation matrix for the variable specified, and pass this to the princomp function:

stockCor <- rxCor(~ stock_price_open + stock_price_high +
stock_price_low + stock_price_close +
stock_price_adj_close, data="NYSE_daily_prices.xdf")
stockPca <- princomp(covmat=stockCor)

This yields the following output:

> summary(stockPca)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 2.0756631 0.8063270 0.197632281 0.0454173922
Proportion of Variance 0.8616755 0.1300327 0.007811704 0.0004125479
Cumulative Proportion 0.8616755 0.9917081 0.999519853 0.9999324005
Standard deviation 1.838470e-02
Proportion of Variance 6.759946e-05
Cumulative Proportion 1.000000e+00
> loadings(stockPca)
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
stock_price_open -0.470 -0.166 0.867
stock_price_high -0.477 -0.151 -0.276 0.410 -0.711
stock_price_low -0.477 -0.153 -0.282 0.417 0.704
stock_price_close -0.477 -0.149 -0.305 -0.811
stock_price_adj_close -0.309 0.951
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
SS loadings 1.0 1.0 1.0 1.0 1.0
Proportion Var 0.2 0.2 0.2 0.2 0.2
Cumulative Var 0.2 0.4 0.6 0.8 1.0

The default plot method for objects of class princomp is a screeplot, which is a barplot of the variances of the principal components. We can obtain the plot as usual by calling plot with our principal components object:

> plot(stockPca)

Between them, the first two principal components explain 99% of the variance; we can therefore replace the five original variables by these two principal components with no appreciable loss of information.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , ,

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)