Hypothesis-Driven Development Part V: Stop-Loss, Deflating Sharpes, and Out-of-Sample
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post will demonstrate a stop-loss rule inspired by Andrew Lo’s paper “when do stop-loss rules stop losses”? Furthermore, it will demonstrate how to deflate a Sharpe ratio to account for the total number of trials conducted, which is presented in a paper written by David H. Bailey and Marcos Lopez De Prado. Lastly, the strategy will be tested on the out-of-sample ETFs, rather than the mutual funds that have been used up until now (which actually cannot be traded more than once every two months, but have been used simply for the purpose of demonstration).
First, however, I’d like to fix some code from the last post and append some results.
A reader asked about displaying the max drawdown for each of the previous rule-testing variants based off of volatility control, and Brian Peterson also recommended displaying max leverage, which this post will provide.
Here’s the updated rule backtest code:
ruleBacktest <- function(returns, nMonths, dailyReturns, nSD=126, volTarget = .1) { nMonthAverage <- apply(returns, 2, runSum, n = nMonths) nMonthAverage <- na.omit(xts(nMonthAverage, order.by = index(returns))) nMonthAvgRank <- t(apply(nMonthAverage, 1, rank)) nMonthAvgRank <- xts(nMonthAvgRank, order.by=index(nMonthAverage)) selection <- (nMonthAvgRank==5) * 1 #select highest average performance dailyBacktest <- Return.portfolio(R = dailyReturns, weights = selection) constantVol <- volTarget/(runSD(dailyBacktest, n = nSD) * sqrt(252)) monthlyLeverage <- na.omit(constantVol[endpoints(constantVol), on ="months"]) wts <- cbind(monthlyLeverage, 1-monthlyLeverage) constantVolComponents <- cbind(dailyBacktest, 0) out <- Return.portfolio(R = constantVolComponents, weights = wts) out <- apply.monthly(out, Return.cumulative) maxLeverage <- max(monthlyLeverage, na.rm = TRUE) return(list(out, maxLeverage)) } t1 <- Sys.time() allPermutations <- list() allDDs <- list() leverages <- list() for(i in seq(21, 252, by = 21)) { monthVariants <- list() ddVariants <- list() leverageVariants <- list() for(j in 1:12) { trial <- ruleBacktest(returns = monthRets, nMonths = j, dailyReturns = sample, nSD = i) sharpe <- table.AnnualizedReturns(trial[[1]])[3,] dd <- maxDrawdown(trial[[1]]) monthVariants[[j]] <- sharpe ddVariants[[j]] <- dd leverageVariants[[j]] <- trial[[2]] } allPermutations[[i]] <- do.call(c, monthVariants) allDDs[[i]] <- do.call(c, ddVariants) leverages[[i]] <- do.call(c, leverageVariants) } allPermutations <- do.call(rbind, allPermutations) allDDs <- do.call(rbind, allDDs) leverages <- do.call(rbind, leverages) t2 <- Sys.time() print(t2-t1)
Here are the two new plots.
meltedDDs <- melt(allDDs) colnames(meltedDDs) <- c("volFormation", "momentumFormation", "MaxDD") ggplot(meltedDDs, aes(x = momentumFormation, y = volFormation, fill=MaxDD)) + geom_tile()+scale_fill_gradient2(high="red", mid="yellow", low="green", midpoint = 0.2) meltedLeverage <- melt(leverages) colnames(meltedLeverage) <- c("volFormation", "momentumFormation", "MaxLeverage") ggplot(meltedLeverage, aes(x = momentumFormation, y = volFormation, fill=MaxLeverage)) + geom_tile()+scale_fill_gradient2(high="red", mid="yellow", low="green", midpoint = 2.5)
Drawdowns:
Leverage:
Here are the results presented as a hypothesis test–a linear regression of drawdowns and leverage against momentum formation period and volatility calculation period:
ddLM <- lm(meltedDDs$MaxDD~meltedDDs$volFormation + meltedDDs$momentumFormation) summary(ddLM) Call: lm(formula = meltedDDs$MaxDD ~ meltedDDs$volFormation + meltedDDs$momentumFormation) Residuals: Min 1Q Median 3Q Max -0.08022 -0.03434 -0.00135 0.02911 0.20077 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.240146 0.010922 21.99 < 2e-16 *** meltedDDs$volFormation -0.000484 0.000053 -9.13 6.5e-16 *** meltedDDs$momentumFormation 0.001533 0.001112 1.38 0.17 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.0461 on 141 degrees of freedom Multiple R-squared: 0.377, Adjusted R-squared: 0.368 F-statistic: 42.6 on 2 and 141 DF, p-value: 3.32e-15 levLM <- lm(meltedLeverage$MaxLeverage~meltedLeverage$volFormation + meltedDDs$momentumFormation) summary(levLM) Call: lm(formula = meltedLeverage$MaxLeverage ~ meltedLeverage$volFormation + meltedDDs$momentumFormation) Residuals: Min 1Q Median 3Q Max -0.9592 -0.5179 -0.0908 0.3679 3.1022 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.076870 0.164243 24.82 <2e-16 *** meltedLeverage$volFormation -0.009916 0.000797 -12.45 <2e-16 *** meltedDDs$momentumFormation 0.009869 0.016727 0.59 0.56 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.693 on 141 degrees of freedom Multiple R-squared: 0.524, Adjusted R-squared: 0.517 F-statistic: 77.7 on 2 and 141 DF, p-value: <2e-16
Easy interpretation here–the shorter-term volatility estimates are unstable due to the one-asset rotation nature of the system. Particularly silly is using the one-month volatility estimate. Imagine the system just switched from the lowest-volatility instrument to the highest. It would then take excessive leverage and get blown up that month for no particularly good reason. A longer-term volatility estimate seems to do much better for this system. So, while the Sharpe is generally improved, the results become far more palatable when using a more stable calculation for volatility, which sets maximum leverage to about 2 when targeting an annualized volatility of 10%. Also, to note, the period to compute volatility matters far more than the momentum formation period when addressing volatility targeting, which lends credence (at least in this case) to so many people that say “the individual signal rules matter far less than the position-sizing rules!”. According to some, position sizing is often a way for people to mask only marginally effective (read: bad) strategies with a separate layer to create a better result. I’m not sure which side of the debate (even assuming there is one) I fall upon, but for what it’s worth, there it is.
Moving on, I want to test out one more rule, which is inspired by Andrew Lo’s stop-loss rule. Essentially, the way it works is this (to my interpretation): it evaluates a running standard deviation, and if the drawdown exceeds some threshold of the running standard deviation, to sit out for some fixed period of time, and then re-enter. According to Andrew Lo, stop-losses help momentum strategies, so it seems as good a rule to test as any.
However, rather than test different permutations of the stop rule on all 144 prior combinations of volatility-adjusted configurations, I’m going to take an ensemble strategy, inspired by a conversation I had with Adam Butler, the CEO of ReSolve Asset Management, who stated that “we know momentum exists, but we don’t know the perfect way to measure it”, from the section I just finished up and use an equal weight of all 12 of the momentum formation periods with a 252-day rolling annualized volatility calculation, and equal weight them every month.
Here are the base case results from that trial (bringing our total to 169).
strat <- list() for(i in 1:12) { strat[[i]] <- ruleBacktest(returns = monthRets, nMonths = i, dailyReturns = sample, nSD = 252)[[1]] } strat <- do.call(cbind, strat) strat <- Return.portfolio(R = na.omit(strat), rebalance_on="months") rbind(table.AnnualizedReturns(strat), maxDrawdown(strat), CalmarRatio(strat))
With the following result:
portfolio.returns Annualized Return 0.12230 Annualized Std Dev 0.10420 Annualized Sharpe (Rf=0%) 1.17340 Worst Drawdown 0.09616 Calmar Ratio 1.27167
Of course, also worth nothing is that the annualized standard deviation is indeed very close to 10%, even with the ensemble. And it’s nice that there is a Sharpe past 1. Of course, given that these are mutual funds being backtested, these results are optimistic due to the unrealistic execution assumptions (can’t trade sooner than once every *two* months).
Anyway, let’s introduce our stop-loss rule, inspired by Andrew Lo’s paper.
loStopLoss <- function(returns, sdPeriod = 12, sdScaling = 1, sdThresh = 1.5, cooldown = 3) { stratRets <- list() count <- 1 stratComplete <- FALSE originalRets <- returns ddThresh <- -runSD(returns, n = sdPeriod) * sdThresh * sdScaling while(!stratComplete) { retDD <- PerformanceAnalytics:::Drawdowns(returns) DDbreakthrough <- retDD < ddThresh & lag(retDD) > ddThresh firstBreak <- which.max(DDbreakthrough) #first threshold breakthrough, if 1, we have no breakthrough #the above line is unintuitive since this is a boolean vector, so it returns the first value of TRUE if(firstBreak > 1) { #we have a drawdown breakthrough if this is true stratRets[[count]] <- returns[1:firstBreak,] #subset returns through our threshold breakthrough nextPoint <- firstBreak + cooldown + 1 #next point of re-entry is the point after the cooldown period if(nextPoint <= (nrow(returns)-1)) { #if we can re-enter, subset the returns and return to top of loop returns <- returns[nextPoint:nrow(returns),] ddThresh <- ddThresh[nextPoint:nrow(ddThresh),] count <- count+1 } else { #re-entry point is after data exhausted, end strategy stratComplete <- TRUE } } else { #there are no more critical drawdown breakthroughs, end strategy stratRets[[count]] <- returns stratComplete <- TRUE } } stratRets <- do.call(rbind, stratRets) #combine returns expandRets <- cbind(stratRets, originalRets) #account for all the days we missed expandRets[is.na(expandRets[,1]), 1] <- 0 #cash positions will be zero rets <- expandRets[,1] colnames(rets) <- paste(cooldown, sdThresh, sep="_") return(rets) }
Essentially, the way it works is like this: the function computes all the drawdowns for a return series, along with its running standard deviation (non-annualized–if you want to annualize it, change the sdScaling parameter to something like sqrt(12) for monthly or sqrt(252) for daily data). Next, it looks for when the drawdown crossed a critical threshold, then cuts off that portion of returns and standard deviation history, and moves ahead in history by the cooldown period specified, and repeats. Most of the code is simply dealing with corner cases (is there even a time to use the stop rule? What about iterating when there isn’t enough data left?), and then putting the results back together again.
In any case, for the sake of simplicity, this function doesn’t use two different time scales (IE compute volatility using daily data, make decisions monthly), so I’m sticking with using a 12-month rolling volatility, as opposed to 252 day rolling volatility multiplied by the square root of 21.
Finally, here are another 54 runs to see if Andrew Lo’s stop-loss rule works here. Essentially, the intuition behind this is that if the strategy breaks down, it’ll continue to break down, so it would be prudent to just turn it off for a little while.
Here are the trial runs:
sharpes <- list() params <- expand.grid(threshVec, cooldownVec) for(i in 1:nrow(params)) { configuration <- loStopLoss(returns = strat, sdThresh = params[i,1], cooldown = params[i, 2]) sharpes[[i]] <- table.AnnualizedReturns(configuration)[3,] } sharpes <- do.call(c, sharpes) loStoplossFrame <- cbind(params, sharpes) loStoplossFrame$improvement <- loStoplossFrame[,3] - table.AnnualizedReturns(strat)[3,] colnames(loStoplossFrame) <- c("Threshold", "Cooldown", "Sharpe", "Improvement")
And a plot of the results.
ggplot(loStoplossFrame, aes(x = Threshold, y = Cooldown, fill=Improvement)) + geom_tile()+scale_fill_gradient2(high="green", mid="yellow", low="red", midpoint = 0)
Result:
Result: at this level, and at this frequency (retaining the monthly decision-making process), the stop-loss rule basically does nothing in order to improve the risk-reward trade-off in the best case scenarios, and in most scenarios, simply hurts. 54 trials down the drain, bringing us up to 223 trials. So, what does the final result look like?
charts.PerformanceSummary(strat)
Here’s the final in-sample equity curve–and the first one featured in this entire series. This is, of course, a *feature* of hypothesis-driven development. Playing whack-a-mole with equity curve bumps is what is a textbook case of overfitting. So, without further ado:
And now we can see why stop-loss rules generally didn’t add any value to this strategy. Simply, it had very few periods of sustained losses at the monthly frequency, and thus, very little opportunity for a stop-loss rule to add value. Sure, the occasional negative month crept in, but there was no period of sustained losses. Furthermore, Yahoo Finance may not have perfect fidelity on dividends on mutual funds from the late 90s to early 2000s, so the initial flat performance may also be a rather conservative estimate on the strategy’s performance (then again, as I stated before, using mutual funds themselves is optimistic given the unrealistic execution assumptions, so maybe it cancels out). Now, if this equity curve were to be presented without any context, one may easily question whether or not it was curve-fit. To an extent, one can argue that the volatility computation period may be optimized, though I’d hardly call a 252-day (one-year) rolling volatility estimate a curve-fit.
Next, I’d like to introduce another concept on this blog that I’ve seen colloquially addressed in other parts of the quantitative blogging space, particularly by Mike Harris of Price Action Lab, namely that of multiple hypothesis testing, and about the need to correct for that.
Luckily for that, Drs. David H. Bailey and Marcos Lopez De Prado wrote a paper to address just that. Also, I’d like to note one very cool thing about this paper: it actually has a worked-out numerical example! In my opinion, there are very few things as helpful as showing a simple result that transforms a collection of mathematical symbols into a result to demonstrate what those symbols actually mean in the span of one page. Oh, and it also includes *code* in the appendix (albeit Python — even though, you know, R is far more developed. If someone can get Marcos Lopez De Prado to switch to R–aka the better research language, that’d be a godsend!).
In any case, here’s the formula for the deflated Sharpe ratio, implemented straight from the paper.
deflatedSharpe <- function(sharpe, nTrials, varTrials, skew, kurt, numPeriods, periodsInYear) { emc <- .5772 sr0_term1 <- (1 - emc) * qnorm(1 - 1/nTrials) sr0_term2 <- emc * qnorm(1 - 1/nTrials * exp(-1)) sr0 <- sqrt(varTrials * 1/periodsInYear) * (sr0_term1 + sr0_term2) numerator <- (sharpe/sqrt(periodsInYear) - sr0)*sqrt(numPeriods - 1) skewnessTerm <- 1 - skew * sharpe/sqrt(periodsInYear) kurtosisTerm <- (kurt-1)/4*(sharpe/sqrt(periodsInYear))^2 denominator <- sqrt(skewnessTerm + kurtosisTerm) result <- pnorm(numerator/denominator) pval <- 1-result return(pval) }
The inputs are the strategy’s Sharpe ratio, the number of backtest runs, the variance of the sharpe ratios of those backtest runs, the skewness of the candidate strategy, its non-excess kurtosis, the number of periods in the backtest, and the number of periods in a year. Unlike the De Prado paper, I choose to return the p-value (EG 1-.
Let’s collect all our Sharpe ratios now.
allSharpes <- c(as.numeric(table.AnnualizedReturns(sigBoxplots)[3,]), meltedSharpes$Sharpe, as.numeric(table.AnnualizedReturns(strat)[3,]), loStoplossFrame$Sharpe)
And now, let’s plug and chug!
stratSignificant <- deflatedSharpe(sharpe = as.numeric(table.AnnualizedReturns(strat)[3,]), nTrials = length(allSharpes), varTrials = var(allSharpes), skew = as.numeric(skewness(strat)), kurt = as.numeric(kurtosis(strat)) + 3, numPeriods = nrow(strat), periodsInYear = 12)
And the result!
> stratSignificant [1] 0.01311
Success! At least at the 5% level…and a rejection at the 1% level, and any level beyond that.
So, one last thing! Out-of-sample testing on ETFs (and mutual funds during the ETF burn-in period)!
symbols2 <- c("CWB", "JNK", "TLT", "SHY", "PCY") getSymbols(symbols2, from='1900-01-01') prices2 <- list() for(tmp in symbols2) { prices2[[tmp]] <- Ad(get(tmp)) } prices2 <- do.call(cbind, prices2) colnames(prices2) <- substr(colnames(prices2), 1, 3) returns2 <- na.omit(Return.calculate(prices2)) monthRets2 <- apply.monthly(returns2, Return.cumulative) oosStrat <- list() for(i in 1:12) { oosStrat[[i]] <- ruleBacktest(returns = monthRets2, nMonths = i, dailyReturns = returns2, nSD = 252)[[1]] } oosStrat <- do.call(cbind, oosStrat) oosStrat <- Return.portfolio(R = na.omit(oosStrat), rebalance_on="months") symbols <- c("CNSAX", "FAHDX", "VUSTX", "VFISX", "PREMX") getSymbols(symbols, from='1900-01-01') prices <- list() for(symbol in symbols) { prices[[symbol]] <- Ad(get(symbol)) } prices <- do.call(cbind, prices) colnames(prices) <- substr(colnames(prices), 1, 5) oosMFreturns <- na.omit(Return.calculate(prices)) oosMFmonths <- apply.monthly(oosMFreturns, Return.cumulative) oosMF <- list() for(i in 1:12) { oosMF[[i]] <- ruleBacktest(returns = oosMFmonths, nMonths = i, dailyReturns = oosMFreturns, nSD=252)[[1]] } oosMF <- do.call(cbind, oosMF) oosMF <- Return.portfolio(R = na.omit(oosMF), rebalance_on="months") oosMF <- oosMF["2009-04/2011-03"] fullOOS <- rbind(oosMF, oosStrat) rbind(table.AnnualizedReturns(fullOOS), maxDrawdown(fullOOS), CalmarRatio(fullOOS)) charts.PerformanceSummary(fullOOS)
And the results:
portfolio.returns Annualized Return 0.1273 Annualized Std Dev 0.0901 Annualized Sharpe (Rf=0%) 1.4119 Worst Drawdown 0.1061 Calmar Ratio 1.1996
And one more equity curve (only the second!).
In other words, the out-of-sample statistics compare to the in-sample statistics. The Sharpe ratio is higher, the Calmar slightly lower. But on a whole, the performance has kept up. Unfortunately, the strategy is currently in a drawdown, but that’s the breaks.
So, whew. That concludes my first go at hypothesis-driven development, and has hopefully at least demonstrated the process to a satisfactory degree. What started off as a toy strategy instead turned from a rejection to a not rejection to demonstrating ideas from three separate papers, and having out-of-sample statistics that largely matched if not outperformed the in-sample statistics. For those thinking about investing in this strategy (again, here is the strategy: take 12 different portfolios, each selecting the asset with the highest momentum over months 1-12, target an annualized volatility of 10%, with volatility defined as the rolling annualized 252-day standard deviation, and equal-weight them every month), what I didn’t cover was turnover and taxes (this is a bond ETF strategy, so dividends will play a large role).
Now, one other request–many of the ideas for this blog come from my readers. I am especially interested in things to think about from readers with line-management responsibilities, as I think many of the questions from those individuals are likely the most universally interesting ones. If you’re one such individual, I’d appreciate an introduction, and knowing who more of the individuals in my reader base are.
Thanks for reading.
NOTE: while I am currently consulting, I am always open to networking, meeting up, consulting arrangements, and job discussions. Contact me through my email at [email protected], or through my LinkedIn, found here.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.