Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post will be about replicating the Marcos Lopez de Prado algorithm from his paper building diversified portfolios that outperform out of sample. This algorithm is one that attempts to make a tradeoff between the classic mean-variance optimization algorithm that takes into account a covariance structure, but is unstable, and an inverse volatility algorithm that ignores covariance, but is more stable.

This is a paper that I struggled with until I ran the code in Python (I have anaconda installed but have trouble installing some packages such as keras because I’m on windows…would love to have someone walk me through setting up a Linux dual-boot), as I assumed that the clustering algorithm actually was able to concretely group every asset into a particular cluster (I.E. ETF 1 would be in cluster 1, ETF 2 in cluster 3, etc.). Turns out, that isn’t at all the case.

Here’s how the algorithm actually works.

First off, it computes a covariance and correlation matrix (created from simulated data in Marcos’s paper). Next, it uses a hierarchical clustering algorithm on a distance-transformed correlation matrix, with the “single” method (I.E. friend of friends–do ?hclust in R to read up more on this). The key output here is the order of the assets from the clustering algorithm. Note well: this is the only relevant artifact of the entire clustering algorithm.

Using this order, it then uses an algorithm that does the following:

Initialize a vector of weighs equal to 1 for each asset.

Then, run the following recursive algorithm:

1) Break the order vector up into two equal-length (or as close to equal length) lists as possible.

2) For each half of the list, compute the inverse variance weights (that is, just the diagonal) of the covariance matrix slice containing the assets of interest, and then compute the variance of the cluster when multiplied by the weights (I.E. w’ * S^2 * w).

3) Then, do a basic inverse-variance weight for the two clusters. Call the weight of cluster 0 alpha = 1-cluster_variance_0/(cluster_variance_0 + cluster_variance_1), and the weight of cluster 1 its complement. (1 – alpha).

4) Multiply all assets in the original vector of weights containing assets in cluster 0 with the weight of cluster 0, and all weights containing assets in cluster 1 with the weight of cluster 1. That is, weights[index_assets_cluster_0] *= alpha, weights[index_assets_cluster_1] *= 1-alpha.

5) Lastly, if the list isn’t of length 1 (that is, not a single asset), repeat this entire process until every asset is its own cluster.

Here is the implementation in R code.

First off, the correlation matrix and the covariance matrix for use in this code, obtained from Marcos Lopez De Prado’s code in the appendix in his paper.

> covMat
V1           V2           V3           V4           V5          V6           V7           V8           V9          V10
1   1.000647799 -0.003050479  0.010033224 -0.010759689 -0.005036503 0.008762563  0.998201625 -0.001393196 -0.001254522 -0.009365991
2  -0.003050479  1.009021349  0.008613817  0.007334478 -0.009492688 0.013031817 -0.009420720 -0.015346223  1.010520047  1.013334849
3   0.010033224  0.008613817  1.000739363 -0.000637885  0.001783293 1.001574768  0.006385368  0.001922316  0.012902050  0.007997935
4  -0.010759689  0.007334478 -0.000637885  1.011854725  0.005759976 0.000905812 -0.011912269  0.000461894  0.012572661  0.009621670
5  -0.005036503 -0.009492688  0.001783293  0.005759976  1.005835878 0.005606343 -0.009643250  1.008567427 -0.006183035 -0.007942770
6   0.008762563  0.013031817  1.001574768  0.000905812  0.005606343 1.064309825  0.004413960  0.005780148  0.017185396  0.011601336
7   0.998201625 -0.009420720  0.006385368 -0.011912269 -0.009643250 0.004413960  1.058172027 -0.006755374 -0.008099181 -0.016240271
8  -0.001393196 -0.015346223  0.001922316  0.000461894  1.008567427 0.005780148 -0.006755374  1.074833155 -0.011903469 -0.013738378
9  -0.001254522  1.010520047  0.012902050  0.012572661 -0.006183035 0.017185396 -0.008099181 -0.011903469  1.075346677  1.015220126
10 -0.009365991  1.013334849  0.007997935  0.009621670 -0.007942770 0.011601336 -0.016240271 -0.013738378  1.015220126  1.078586686
> corMat
V1           V2           V3           V4           V5          V6           V7           V8           V9          V10
1   1.000000000 -0.003035829  0.010026270 -0.010693011 -0.005020245 0.008490954  0.970062043 -0.001343386 -0.001209382 -0.009015412
2  -0.003035829  1.000000000  0.008572055  0.007258718 -0.009422702 0.012575370 -0.009117080 -0.014736040  0.970108941  0.971348946
3   0.010026270  0.008572055  1.000000000 -0.000633903  0.001777455 0.970485047  0.006205079  0.001853505  0.012437239  0.007698212
4  -0.010693011  0.007258718 -0.000633903  1.000000000  0.005709500 0.000872861 -0.011512172  0.000442908  0.012052964  0.009210090
5  -0.005020245 -0.009422702  0.001777455  0.005709500  1.000000000 0.005418538 -0.009347204  0.969998023 -0.005945165 -0.007625721
6   0.008490954  0.012575370  0.970485047  0.000872861  0.005418538 1.000000000  0.004159261  0.005404237  0.016063910  0.010827955
7   0.970062043 -0.009117080  0.006205079 -0.011512172 -0.009347204 0.004159261  1.000000000 -0.006334331 -0.007592568 -0.015201540
8  -0.001343386 -0.014736040  0.001853505  0.000442908  0.969998023 0.005404237 -0.006334331  1.000000000 -0.011072068 -0.012759610
9  -0.001209382  0.970108941  0.012437239  0.012052964 -0.005945165 0.016063910 -0.007592568 -0.011072068  1.000000000  0.942667300
10 -0.009015412  0.971348946  0.007698212  0.009210090 -0.007625721 0.010827955 -0.015201540 -0.012759610  0.942667300  1.000000000


Now, for the implementation.

This reads in the two matrices above and gets the clustering order.

covMat <- read.csv('cov.csv', header = FALSE)

clustOrder <- hclust(dist(corMat), method = 'single')\$order


This is the clustering order:

> clustOrder
[1]  9  2 10  1  7  3  6  4  5  8


Next, the getIVP (get Inverse Variance Portfolio) and getClusterVar functions (note: I’m trying to keep the naming conventions identical to Dr. Lopez’s paper)

getIVP <- function(covMat) {
invDiag <- 1/diag(as.matrix(covMat))
weights <- invDiag/sum(invDiag)
return(weights)
}

getClusterVar <- function(covMat, cItems) {
covMatSlice <- covMat[cItems, cItems]
weights <- getIVP(covMatSlice)
cVar <- t(weights) %*% as.matrix(covMatSlice) %*% weights
return(cVar)
}


Next, my code diverges from the code in the paper, because I do not use the list comprehension structure, but instead opt for a recursive algorithm, as I find that style to be more readable.

One wrinkle to note is the use of the double arrow dash operator, to assign to a variable outside the scope of the recurFun function. I assign the initial weights vector w in the global environment, and update it from within the recurFun function. I am aware that it is a faux pas to create variables in the global environment, but my attempts at creating a temporary environment in which to update the weight vector did not produce the updating mechanism I had hoped to, so a little bit of assistance with refactoring this code would be appreciated.

getRecBipart <- function(covMat, sortIx) {
# keeping track of w in the global environment
assign("w", value = rep(1, ncol(covMat)), envir = .GlobalEnv)
recurFun(covMat, sortIx)
return(w)
}

recurFun <- function(covMat, sortIx) {
subIdx <- 1:trunc(length(sortIx)/2)
cItems0 <- sortIx[subIdx]
cItems1 <- sortIx[-subIdx]
cVar0 <- getClusterVar(covMat, cItems0)
cVar1 <- getClusterVar(covMat, cItems1)
alpha <- 1 - cVar0/(cVar0 + cVar1)

# scoping mechanics using w as a free parameter
w[cItems0] <<- w[cItems0] * alpha
w[cItems1] <<- w[cItems1] * (1-alpha)

if(length(cItems0) > 1) {
recurFun(covMat, cItems0)
}
if(length(cItems1) > 1) {
recurFun(covMat, cItems1)
}
}


Lastly, let’s run the function.

out <- getRecBipart(covMat, clustOrder)


With the result (which matches the paper):

> out
[1] 0.06999366 0.07592151 0.10838948 0.19029104 0.09719887 0.10191545 0.06618868 0.09095933 0.07123881 0.12790318


So, hopefully this democratizes the use of this technology in R. While I have seen a raw Rcpp implementation and one from the Systematic Investor Toolbox, neither of those implementations satisfied me from a “plug and play” perspective. This implementation solves that issue. Anyone here can copy and paste these functions into their environment and immediately make use of one of the algorithms devised by one of the top minds in quantitative finance.

A demonstration in a backtest using this methodology will be forthcoming.