Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Yuri Fonseca

The idea of this post is to give an empirical example of how Principal Component Analysis (PCA) can be applied in Finance, especially in the Fixed Income Market. Principal components are very useful to reduce data dimensionality and give a joint interpretation to a group of variables. For example, one could use it to try to extract a common trend from several variables.

This text is divided in two parts, the first is a fast review about the math behind PCA, and the second is an empirical example with American bonds.

## What is PCA?

The PCA is based on the eigenvectors and eigenvalues of the matrix  $V_{kxk} = X'X/T$, which is the covariance matrix of $X_{Txk}$. The idea is to apply a linear transformation on $V$ in such a way that in the new space, the first component is responsible for most of the variance of the data, the second component should be orthogonal to the first component and explain the maximum of the remaining variance of the data, and so on until all variance is explained. Therefore PCA is an orthogonal transformation and it is also commonly named as Singular Value Decomposition (SVD). The columns of the matrix responsible for this transformation are called factor loadings, and their eigenvalues are the variance of each principal component.

One of the most important applications of PCA is in dimensionality reduction.Given that the first components explain almost all the variance of the data, one could discard dimensions after some criteria being achieved, commonly a threshold in the explained variance. We will see in the empirical example that only three dimensions are enough to deal with an original thirteen dimension problem (thirteen American bonds). It’s important to state that PCA is highly sensitive to data scale, and normally scaling and centring the variables is important.

## Application in Fixed Income Market

The data used in this example can be downloaded  here. It consists of annualized zero-coupon from US using monthly data from 1944 to 1992. This is exactly the same data exposed in Alexander (2001) in the book Market Models. The maturities are 1, 3, 6, 9, 12, 18 months and 2, 3, 4, 5, 7, 10, 15 years.

yields = as.matrix(read.csv('yields.csv', dec = ',', sep = ';'))
colnames(yields) <- c(paste(c(1, 3, 6, 9, 12, 18), 'mth'), paste(c(2, 3, 4, 5, 7, 10, 15), 'yr'))

matplot(yields, type='l', ylim = c(0,18),ylab = 'rate', xlab = 'Time')


Although yields have a strong correlation structure, they are not good for modelling because they are not stationary, so we need to proceed looking for what is called an invariant of the market. Roughly speaking, we need some sort of iid properties, and yields don’t have it. Following Meucci (2005), taking the first difference is good enough. For simplicity, we will call $R_{t,k} = Yield_{i,k} - Yield_{i-1,k}$ as the yield return of maturity $k$ at time $t$.

returns = diff(yields, 1)

plot(returns[,'1 mth'], main = 'yield return 1 month and 2 years', type = 'l', ylab = 'Return')
lines(returns[,'2 yr'], col = 2)


Now we can check for some sort of correlation in the yield return matrix. This correlation is quite intuitive, if the bond with one month maturity starts to pay less, you expect some kind of impact in the bond with two months maturity. The bigger the difference between maturities, the lower the correlation, also it’s a good check, because if we can’t see correlation in the data, PCA won’t help us to reduce dimensionality.

options(digits = 2) # just for better visualisation
cor(returns) # correlation matrix of yield returns

##        1 mth 3 mth 6 mth 9 mth 12 mth 18 mth 2 yr 3 yr 4 yr 5 yr 7 yr
## 1 mth   1.00  0.79  0.73  0.69   0.66   0.63 0.60 0.54 0.49 0.48 0.44
## 3 mth   0.79  1.00  0.93  0.89   0.84   0.81 0.77 0.71 0.66 0.63 0.58
## 6 mth   0.73  0.93  1.00  0.97   0.93   0.91 0.88 0.82 0.77 0.75 0.69
## 9 mth   0.69  0.89  0.97  1.00   0.99   0.97 0.94 0.89 0.85 0.82 0.77
## 12 mth  0.66  0.84  0.93  0.99   1.00   0.98 0.94 0.91 0.86 0.84 0.78
## 18 mth  0.63  0.81  0.91  0.97   0.98   1.00 0.99 0.96 0.92 0.90 0.85
## 2 yr    0.60  0.77  0.88  0.94   0.94   0.99 1.00 0.97 0.93 0.92 0.87
## 3 yr    0.54  0.71  0.82  0.89   0.91   0.96 0.97 1.00 0.99 0.98 0.93
## 4 yr    0.49  0.66  0.77  0.85   0.86   0.92 0.93 0.99 1.00 0.99 0.94
## 5 yr    0.48  0.63  0.75  0.82   0.84   0.90 0.92 0.98 0.99 1.00 0.98
## 7 yr    0.44  0.58  0.69  0.77   0.78   0.85 0.87 0.93 0.94 0.98 1.00
## 10 yr   0.39  0.53  0.65  0.72   0.74   0.81 0.83 0.88 0.90 0.94 0.97
## 15 yr   0.31  0.45  0.56  0.62   0.64   0.70 0.72 0.77 0.77 0.81 0.84
##        10 yr 15 yr
## 1 mth   0.39  0.31
## 3 mth   0.53  0.45
## 6 mth   0.65  0.56
## 9 mth   0.72  0.62
## 12 mth  0.74  0.64
## 18 mth  0.81  0.70
## 2 yr    0.83  0.72
## 3 yr    0.88  0.77
## 4 yr    0.90  0.77
## 5 yr    0.94  0.81
## 7 yr    0.97  0.84
## 10 yr   1.00  0.94
## 15 yr   0.94  1.00


Using the function prcomp from package stats we can perform the PCA on the returns. Although scale and center are set equals TRUE by default, it’s important to remember that PCA is sensitive to the scale of the data, so it’s necessary to standardize the variables.

Once we have components in a decreasing other of explained variance, we can project the new matrix into a lower dimensional space. It is important to observe the eigenvalues of the new matrix.

model = prcomp(returns, scale = TRUE, center = TRUE)
summary(model)

## Importance of components:
##                          PC1   PC2   PC3    PC4    PC5     PC6     PC7
## Standard deviation     3.257 1.204 0.635 0.5091 0.3684 0.25940 0.20749
## Proportion of Variance 0.816 0.111 0.031 0.0199 0.0104 0.00518 0.00331
## Cumulative Proportion  0.816 0.927 0.958 0.9783 0.9887 0.99391 0.99722
##                            PC8     PC9    PC10    PC11    PC12    PC13
## Standard deviation     0.18868 0.01381 0.01014 0.00967 0.00862 0.00729
## Proportion of Variance 0.00274 0.00001 0.00001 0.00001 0.00001 0.00000
## Cumulative Proportion  0.99996 0.99998 0.99998 0.99999 1.00000 1.00000

par(mfrow = c(1,2))
barplot(model$sdev^2, main = 'Eigenvalues of each component') barplot(cumsum(model$sdev^2)/sum(model$sdev^2), main = 'Cumulative Explained Variance', ylab = 'Variance Explained')  It’s straight forward to see that the projection over just three principal components can explain almost 96% percent of the variance over thirteen contracts! Looking at the factor loadings, is possible to see how each one of them affect the returns of the yields. The figure above shows that the first factor loading is responsible for a level change, also called parallel shift, in all bonds. The second factor loading is called slope change or steepening/flattening of the curve, and the third factor loading is responsible for curvature effect, or convexity. To recover the yield return with maturity (k) at time (t) we just need to calculate $R_{t,k} = W_{1,*}P_{t,1} + W_{2,*}P_{t,2} + W_{3,*}P_{t,3}$, and since the principal components are orthogonal, their covariance matrix has only elements in the diagonal, which are precisely the eigenvalues calculated before. For one standard-deviation, since principal components are orthogonal, the impact on yield returns for the first, second and third component are given by:$$R_{t+1,*} = R_{t,*} \pm \sqrt{\lambda_1}W'_{*,1}$$R_{t+1,*} = R_{t,*} \pm \sqrt{\lambda_2}W'_{*,2}$$R_{t+1,*} = R_{t,*} \pm \sqrt{\lambda_3}W'_{*,3}$$Where $W_{*,k}$ represents the $k$-column of the factor loading matrix and $\sqrt\lambda_k$ represents the standard deviation of the $k$-principal component. Once the invariant was built with the difference between yields, to recover the yield we just use the following expression: $Yield_{t+1,*} = Yield_{t,*} + R_{t+1,*}$ par(mfrow = c(1,3)) hist(model$x[,1], breaks = 20, main = 'Distribution 1 component', xlab = paste('SD', round(model$sdev[1],2))) hist(model$x[,2], breaks = 20, main = 'Distribution 2 component', xlab = paste('SD', round(model$sdev[2],2))) hist(model$x[,3], breaks = 20, main = 'Distribution 3 component', xlab = paste('SD', round(model$sdev[3],2)))  To predict and proceed with risk analysis of the yield curve, one can now model the joint distribution of the factors and get the simulated yield returns as $R_{t+\tau,*} = P_{t+\tau,*}W'$ Where $W$ is the factor loading matrix with the three eigenvalues chosen. The simulated yield can now be constructed just adding the returns to the last yield observation. require(mvtnorm) n.sim = 10000 n.ahead = 5 simulated.returns = array(dim = c(5, 13, 10000)) cumulative.returns = array(dim = c(5, 13, 10000)) for (i in 1:n.sim) { simulated.factors = rmvnorm(5, mean = rep(0,3), sigma = diag(model$sdev[1:3]))
simulated.returns[,,i] = simulated.factors%*%t(model\$rotation[,1:3])

cumulative.returns[,,i] = apply(simulated.returns[,,i], 2, cumsum)
}

simulated.yields = array(dim = c(5, 13, 10000))
for (i in 1:n.sim) {
simulated.yields[u,,i] = yields[nrow(yields), ] + cumulative.returns[u,,i]
}
}

par(mfrow = c(1,2))
plot(yields[570:587,7], type = 'l', main = 'Yield returns 1 year', xlim = c(1,25), ylim = c(0, 12), ylab = 'Yield')
for (i in 1:100) lines(c(rep(NA,17), yields[587,7], simulated.yields[,7,i]), col = i)

plot(yields[570:587,12], type = 'l', main = 'Yield returns 10 years', xlim = c(1,25), ylim = c(0, 12), ylab = 'Yield')
for (i in 1:100) lines(c(rep(NA,17), yields[587,12], simulated.yields[,12,i]), col = i)


And for last, the confidence intervals:

lower = matrix(ncol = ncol(yields), nrow = n.ahead); upper = matrix(ncol = ncol(yields), nrow = n.ahead);
for (i in 1:ncol(yields)) {
lower[u,i] = quantile(simulated.yields[u,i,], probs = 0.025)
upper[u,i] = quantile(simulated.yields[u,i,], probs = 0.975)
}
}

plot(yields[570:587,7], type = 'l', main = 'Yield returns 1 year', xlim = c(1,25), ylim = c(2, 10), ylab = 'Yield', xlab = '')
lines(c(rep(NA,17), yields[587,7], upper[,7]), col = 2)
lines(c(rep(NA,17), yields[587,7], lower[,7]), col = 2)


plot(yields[570:587,12], type = 'l', main = 'Yield returns 10 years', xlim = c(1,25), ylim = c(2, 10), ylab = 'Yield', xlab = '')
lines(c(rep(NA,17), yields[587,12], upper[,12]), col = 2)
lines(c(rep(NA,17), yields[587,12], lower[,12]), col = 2)