**Quintuitive » R**, and kindly contributed to R-bloggers)

In this tutorial I am going to share my R&D and trading experience using the well-known from statistics Autoregressive Moving Average Model (ARMA). There is a lot written about these models, however, I strongly recommend Introductory Time Series with R, which I find is a perfect combination between light theoretical background and practical implementations in R. Another good reading is the online e-book Forecasting: principles and practice written by Rob Hyndman, an expert in statistical forecasting and the author of the excellent forecast R package.

## Getting Started

In R, I am mostly using the fArma package, which is a nice wrapper with extended functionality around the *arima* function from the *stats* package (used in the above-mentioned book). Here is a simple session of fitting an ARMA model to the S&P 500 daily returns:

library( quantmod ) library( fArma ) # Get S&P 500 getSymbols( "^GSPC", from="2000-01-01" ) # Compute the daily returns gspcRets = diff( log( Cl( GSPC ) ) ) # Use only the last two years of returns gspcTail = as.ts( tail( gspcRets, 500 ) ) # Fit the model gspcArma = armaFit( formula=~arma(2,2), data=gspcTail )

For more details, please refer to the literature and the packages, I just want to emphasize on a couple of points:

- We model the daily returns instead of the prices. There are multiples reasons: this way financial series usually become stationary, we need some way to “normalize” a series, etc.
- We use the
*diff*and*log*function to compute the daily returns instead of percentages. Not only this is a standard practice in statistics, but it also provides a damn good approximation to the discrete returns.

The approach I will present here is a form of walk-forward backtesting. While walking the series day by day, we will use history of certain length to find the best model. Then we will use this model to predict the next day’s return. If the prediction is negative, we assume short position, otherwise we assume a long position.

An example will make things clearer: After the close of June 11th, 2012, we compute the last 500 daily returns. Using these returns we search through the space of ARMA models and select the best-fitting (with respect to some metric and some requirements) model. Finally, we use this model to compute the prediction for the tomorrow’s return and use the sign of the return to decide the appropriate position.

## Choosing a Good Model

The first obstacle for this method before it could be useful to us, is to select the model parameters. In the case of ARMA, there are two parameters. In other words, there is an infinite number of choices: (0,1), (1,0), (1,1), (2,1), etc. How do we know what parameters to use?

A common approach in statistics to quantify the goodness of fit test is the AIC (for Akaike Information Criteria) statistic. Once the fitting is done, the value of the aic statistics is accessible via:

xxArma = armaFit( xx ~ arma( 5, 1 ), data=xx ) xxArma@fit$aic

There are other statistics of course, however, typically the results are quite similar.

To summarize, all we need is a loop to go through all parameter combinations we deem reasonable, for instance from (0,0) to (5,5), inclusive, for each parameter pair fit the model, and finally pick the model with the lowest AIC or some other statistic.

# https://gist.github.com/2913657 armaSearch = function( xx, minOrder=c(0,0), maxOrder=c(5,5), trace=FALSE ) { bestAic = 1e9 len = NROW( xx ) for( p in minOrder[1]:maxOrder[1] ) for( q in minOrder[2]:maxOrder[2] ) { if( p == 0 && q == 0 ) { next } formula = as.formula( paste( sep="", "xx ~ arma(", p, ",", q, ")" ) ) fit = tryCatch( armaFit( formula, data=xx ), error=function( err ) FALSE, warning=function( warn ) FALSE ) if( !is.logical( fit ) ) { fitAic = fit@fit$aic if( fitAic < bestAic ) { bestAic = fitAic bestFit = fit bestModel = c( p, q ) } if( trace ) { ss = paste( sep="", "(", p, ",", q, "): AIC = ", fitAic ) print( ss ) } } else { if( trace ) { ss = paste( sep="", "(", p, ",", q, "): None" ) print( ss ) } } } if( bestAic < 1e9 ) { return( list( aic=bestAic, fit=bestFit, model=bestModel ) ) } return( FALSE ) }

Note that sometimes armaFit fails to find a fit and returns an error, thus quitting the loop immediately. *armaSearch* handles this problem by using the *tryCatch* function to catch any error or warning and return a logical value (FALSE) instead of interrupting everything and exiting with an error. Thus we can distinguish an erroneous and normal function return just by checking the type of the result. A bit messy probably, but it works.

Some R packages, forecast and rugarch for instance, provide a similar, *auto.arima* function out of the box. So one can build his infrastructure around one of these instead.

## Forecasting

Once the parameters are selected, it’s time to determine the position at the close. One way to do that is by a one day ahead prediction, if the prediction comes negative (remember the series we are operating on is the daily returns) then the desired position is short, otherwise it’s long.

library( quantmod ) library( fArma ) getSymbols( "SPY", from="1900-01-01" ) spyRets = diff( log( Cl( SPY )["/2012-05-29"] ) ) spyArma = armaFit( ~arma(0, 2), data=as.ts( tail( spyRets, 500 ) ) ) as.numeric( predict( spyArma, n.ahead=1, doplot=F )$pred ) # -0.0004558926

Now, to build an indicator for the back testing, one can walk the daily return series and at each point perform the steps we covered so far. The main loop looks like (shortened on purpose):

# currentIndex is the index of the day we are making a forcast for # xx is the return series # history is look-back period to consider at each point repeat { nextIndex = currentIndex + 1 # lags is how many days behind is the data, the default is 1, # meaning use data up to yesterdays close forecastLength = nextIndex - currentIndex + lags - 1 # Get the series yy = xx[index(xx)[(currentIndex-history-lags+1):(currentIndex-lags)]] # Find the best fit bestFit = armaSearch( yy, minOrder, maxOrder, withForecast=TRUE, # we want the model to have a valid forecast forecastLength=forecastLength, # 1 for a dialy forecast trace=trace, cores=cores ) # the number of cores to use if( !is.null( bestFit ) ) { # Forecast fore = tryCatch( predict( bestFit, n.ahead=forecastLength, doplot=FALSE ), error=function( err ) FALSE, warning=function( warn ) FALSE ) if( !is.logical( fore ) ) { # Save the forecast forecasts[currentIndex] = tail( fore$pred, 1 ) # Save the model order ars[currentIndex] = order[1] mas[currentIndex] = order[2] forecasts[currentIndex] = 0 } if( nextIndex > len ) break currentIndex = nextIndex } }

Where *history* is the look-back period to consider at each point, I usually use 500, which is about two years of data. In other words, to determine the position at each individual day (previous day close to the current day close determines the return) we use history of 500 days, lagged by *lags* day. You will see later how *lags* comes into play in practice.

Notice, that *predict* has also to be surrounded by a *tryCatch* block. *armaSearch* also has the nice feature to determine whether a model has a forecast or not (*predict* succeeds or not, this test is controlled via the *withForecast* parameter).

## Improving Performance

The number of computations we have to do adds up quickly. For example, for 10 years of historic data we need to compute about 2,520 trading days. For each day we are going to fit and predict at least 35 (35=6*6-1, 0 to 5 both for the AR and MA component, but excluding the (0,0) combination) models. Multiplying the number of models by the number of days, and we are already looking at more than 88 thousand model fits – that’s a lot of computations.

One way to improve the performance of these necessary computations can be achieved by exploiting multi-core CPUs. My approach is to parallelize the model selection, the *armaSearch* function in the above code. Although this may not be the most efficient approach, it is certainly the more practical since it will also boost the performance of *armaSearch* when used independently.

I won’t post the final version of the code here due to it’s length. I will give you the GIST link instead!

## Modeling Volatility with GARCH

Financial time series are random in general. One of the few properties they exhibit is Volatility Clustering. This is typically achieved by extending the ARMA forecasting with a GARCH model. Sounds complex, and the theoretical details are complex indeed, but it turns out to be pretty straightforward in R:

library(quantmod) library(fGarch) getSymbols("SPY", from="1900-01-01") spyRets = diff(log(Ad(SPY))) spyGarch = garchFit(~arma(0, 2) + garch(1, 1), data=as.ts(tail(spyRets, 500))) predict(spyGarch, n.ahead=1, doplot=F) # the actual forecasts are predict(spyGarch, n.ahead=1, doplot=F)[,1]

Of course, we also need to modify all relevant functions, like *armaSearch*. Calls to *garchFit* and *predict* also need to be handled via *tryCatch*. Notice also that *predict* returns a matrix for GARCH models.

If you are interested in the full source code, you can contact us.

## S&P 500 Performance

Let’s start with the equity curve of applying the ARMA+GARCH strategy over the full 60 years (since 1950) of S&P 500 historic data.

It looks fantastic! In fact, it impressed me so much that I looked for bugs in the code for quite some time. Even on a logarithmic chart the performance of this method is stunning – CAGR of 18.87%, and the ARMA+GARCH strategy achieves this performance with a comparable maximum drawdown of 56%.

To compute the ARMA strategy growth, we first need the daily indicator (this indicator takes about two days to compute with all optimizations I went through).

The first column is the date, the second the position for this day: 1 for long, -1 for short, 0 for none. Note, the position is already aligned with the day of the return (it is computed at the close of the previous day), in other words, the indicator is aligned properly with the returns – no need to shift right via *lag*. The indicator, the first column, needs to be multiplied with the S&P 500 daily returns. The rest of the columns are irrelevant and hopefully self-explanatory.

Let’s wrap up the post with the code that loads the indicator and plots the graphic:

library(quantmod) library(lattice) library(timeSeries) getSymbols("^GSPC", from="1900-01-01") gspcRets = Ad(GSPC) / lag(Ad(GSPC)) - 1 gspcRets[as.character(head(index(Ad(GSPC)),1))] = 0 # The maximum draw down head(drawdownsStats(as.timeSeries(gspcRets)),10) # The largest dropdawn is: # From Trough To Depth Length ToTrough Recovery # 1 2007-10-10 2009-03-09 2012-09-28 -0.5677539 1255 355 NA # Load the ARMA indicator gspcArmaInd = as.xts( read.zoo(file="gspcInd.csv", format="%Y-%m-%d", header=T, sep=",") ) # Filter out only the common indexes mm = merge( gspcArmaInd[,1], gspcRets, all=F ) gspcArmaRets = mm[,1] * mm[,2] # The maximum draw down head(drawdownsStats(as.timeSeries(gspcArmaRets)),10) # The largest dropdawn is: # From Trough To Depth Length ToTrough Recovery # 1 1987-10-26 1992-10-09 1997-10-27 -0.5592633 2531 1255 1276 gspcArmaGrowth = log( cumprod( 1 + gspcArmaRets ) ) gspcBHGrowth = log( cumprod( 1 + mm[,2] ) ) gspcAllGrowth = merge( gspcArmaGrowth, gspcBHGrowth, all=F ) xyplot( gspcAllGrowth, superpose=T, col=c("darkgreen", "darkblue"), lwd=2, key=list( x=.01, y=0.95, text=list(c("ARMA", "Buy-and-Hold")), lines=list(lwd=2, col=c("darkgreen", "darkblue"))))

**leave a comment**for the author, please follow the link and comment on their blog:

**Quintuitive » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...