R for Quants, Part II (A)

Posted on February 16, 2012 by Brian Lee Yung Rowe in R bloggers | 0 Comments

[This article was first published on Cartesian Faith » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the second part in a three part series on teaching R to MFE students at CUNY Baruch. The focus of this lesson is on basic statistical methods in R.

PART I: PRELIMINARIES

PART II: STATISTICS

PART III: STRUCTURING CODE

A. Dispatching Systems
B. Real World Development

Distributions

As a statistical programming language, R provides many functions for describing distributions. Standard distributions include normal, binomial, uniform, chi-squared, etc. Others are available from add-on packages.

Distribution functions follow a standard pattern in R, which makes it easy to use any of the available variants. The stats package, which is automatically loaded, provides functions for many common distributions. To see what is available, try ?Distributions.

Prefix	Description	Example
d*	density function (PDF)	`dnorm`
p*	distribution function (CDF)	`pnorm`
q*	quantile function (inverse CDF)	`qnorm`
r*	random generation	`rnorm`

Let’s look at a plot for each of these functions. Notice the usage of par and plot to generate each chart. When adding multiple charts to a plot, special parameters must be set to configure the layout. When you are done, it’s good practice to reset the configuration by calling par again with the old par object (as returned by the function previously).

opar <- par(mfrow=c(2,2), mar=c(3,5,1,1)+0.1)
x <- seq(-5,5, by=0.1)
y <- seq(0,1, by=0.02)

plot(x, dnorm(x), t='l')
plot(x, pnorm(x), t='l')
plot(y, qnorm(y), t='l')
plot(x, rnorm(x))
par(opar)

The graphics capabilities in R are quite extensive [1] but can be quite cumbersome to work with. There are numerous packages that attempt to improve the way graphics are handled in R including ggplot, lattice [3].

Exercise: Write a function plot_dists(name) that generates the above plot for the given distribution.

Analyzing Distributions

It is easy to spot check whether a given set of data fits a distribution by using a QQ plot. The qqnorm and qqline functions use a normal distribution as a default. For other distributions, use qqplot to explicitly specify the x data series.

h <- getPortfolioReturns(c('AAPL','XOM','KO','F','GS'), 100)
qqplot(h$XOM)
qqline(h$XOM)

As expected, the tails do not fit a normal distribution. To align the theoretical quantiles with the sample quantiles, you have two approaches: normalize the sample data or use a different set of theoretical quantiles. [2]

z <- (h$XOM - mean(h$XOM)) / sd(as.vector(h$XOM))
qqnorm(z)
qqline(z)

Exercise: Generate a QQ plot using a better fit for the theoretical distribution. Hint: Try using rnorm.

Parameter Estimation

To fit data to a particular distribution, you can use any of the numerous optimization functions (optim, nlm, optimize, etc) to minimize your log likelihood function. Here is a contrived example of estimating the mean and variance of one of our stocks.

get_estimator <- function(x)
{
  n <- length(x)
  function(theta)
  {
    n/2*log(2*pi) + n/2*log(theta[2]) + 1/(2*theta[2]) * sum((x-theta[1])^2)
  }
}

Let's execute the function and compare with the built-in functions.

> f <- get_estimator(h$XOM)
> optim(c(0,1), f)$par
[1] 0.0020516656 0.0002079123
> mean(h$XOM)
[1] 0.002052150
> sd(h$XOM)^2
         XOM
0.0002100090

Note the use of a closure in this example. Once we have the closure, it is easy to pass the closure to other optimization functions for experimentation. We will discuss optimization in more detail in the next section.

Sampling Data

R provides many methods for sampling data from an existing data set, depending on what you want to accomplish. The simplest is sample, which selects entries from the specified vector. By default, the probability of selecting a specific element is equally-weighted over the elements. For example, this is appropriate when drawing cards from a deck.

> cards <- sapply(c('H','S','C','D'), function(x) paste(x, 1:13, sep=''))
> sample(cards, 5)
[1] "C12" "S6" "H10" "S12" "D7"

To bootstrap a data set, set replace=TRUE, which will reuse all elements for each sample selection. This option is appropriate when flipping a coin multiple times.

> sample(c('H','T'), 10, replace=TRUE)
 [1] "T" "H" "H" "H" "H" "T" "T" "T" "H" "H"

For more sophisticated bootstrapping, see the boot package [4].

Covariance and Correlation

Not surprisingly, there are built-in functions for calculating the covariance (cov) and correlation (cor) between data. You can even convert between them with cov2cor.

> cov(h$XOM, h$F)

Exercise: For portfolio h, write a function to calculate the covariance matrix without using cov.

Regression

Linear regression is performed by using the lm function, which fits data to linear models. We can use it to estimate the parameters of the CAPM for our toy portfolio. Note we assume the risk free rate is 0 to simplify the example.

> ws <- t(sapply(1:100, function(x) rdirichlet(rep(1,5)) ))
> p <- apply(h * ws, 1, sum)
> m <- getPortfolioReturns('^GSPC',100)
> m$portfolio <- p
> colnames(m) <- c('market','portfolio')
> out <- lm(portfolio ~ market, m)
> summary(out)

In this example we are generating random weights for each day to calculate our portfolio returns. Afterward, we download our returns for our market and run the regression.

> plot(as.vector(m$market), as.vector(m$portfolio))
> abline(out)

A good discussion of doing regression analysis in R is in [5]. Other forms of regression analysis are supported as well but is out of scope for this document.