Parallelized Back Testing

[This article was first published on Quintuitive » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As mentioned earlier, currently I am playing with trading strategies based on Support Vector Machines. At a high level, the approach is quite similar to what I have implemented for my ARMA+GARCH strategy. Briefly, the simulation goes as follows: we step through the series one period (day, week, etc) at a time. For each period, we use history of pre-defined length to determine the best model parameters. Then, using these parameters, we forecast the next period.

There are two ways to parallelize this process. The approach taken in the ARMA+GARCH system, parallelizes the model selection (done for each period), while it steps through the series sequentially. There are good reasons why I did it this way, for instance, the garchSearch function is usable standalone. Another reason was probably that I was dumb enough to hope that it (garchSearch function)may get picked up and added to a GARCH package (fGarch preferably), so that I don’t have to maintain it myself.:)

In any case, this is definitely the more inefficient approach. In parallel execution, it pays off to increase the size of the task executed in parallel. The more efficient alternative in our case, is to step through the series in parallel, while each task does the fitting for all possible models for a given period. A different way to think about it is how many tasks will be executed in total. In the ARMA+GARCH system, we execute inputeSeriesLength*numberOfModels tasks, while in the second we only execute inputSeriesLength tasks. Since the total amount of work is the same, certainly the latter approach has bigger tasks, thus, less overhead in terms of starting/stopping new tasks.

This approach has another advantage as well, it also works better with the way SVMs, Neural Networks and other similar apparatus is implemented in R. For instance, the e1071 package comes with a tune function which already does the tuning among a predefined set of models. Thus, if we want to make use of the provided tune, then we need to parallelize the tune invocations.

Let’s take a look at the driver code for an SVM (the comments should be helpful understanding the code):

svmComputeForecasts = function(
      data,
      response,
      history=500,
      modelPeriod="days",
      modelPeriodMultiple=1,
      trace=TRUE,
      startDate,
      endDate,
      # All following parameters are system (SVM, Neural Network, etc) dependent
      kernel="radial",
      gamma=10^(-5:-1),
      cost=10^(0:2),
      sampling="cross",
      cross=10,
      selectFeatures=TRUE,
      cores)
{
   require( e1071 )

   stopifnot( NROW( data ) == NROW( response ) ) 

   len = NROW( response )

   # Determine the starting index
   if( !missing( startDate ) ) 
   {   
      startIndex = max( len - NROW( index( data[paste( sep="", startDate, "/" )] ) ) + 1,
                        history + 2 ) 
   }   
   else
   {   
      startIndex = history + 2 
   }   

   # Determine the ending index
   if( missing( endDate ) ) 
   {   
      lastIndex = len 
   }   
   else
   {   
      lastIndex = NROW( index( data[paste( sep="", "/", endDate )] ) ) 
   }   

   if( startIndex > lastIndex )
   {   
      return( NULL )
   }   

   modelPeriod = tolower( modelPeriod[1] )

   # Get the interesting indexes
   periods = index(data)[startIndex:lastIndex]

   # Compute the end points for each period (day, week, month, etc)
   endPoints = endpoints( periods, modelPeriod, modelPeriodMultiple )

   # Compute the starting points of each period, relative to the *data* index
   startPoints = endPoints + startIndex

   # Remove the last start point - it's outside
   length(startPoints) = length(startPoints) - 1

   # Make the end points relative to the *data* index
   endPoints = endPoints + startIndex - 1

   # Remove the first end point - it's always zero
   endPoints = tail( endPoints, -1 )

   stopifnot( length( endPoints ) == length( startPoints ) )

   # Initialize the number of cores
   if( missing( cores ) ) {
      cores = 1
   }

   res = mclapply( seq(1,length(startPoints)),
                   svmComputeOneForecast,
                   data=data,   # System dependent 
                   response=response,
                   startPoints=startPoints,
                   endPoints=endPoints,
                   len=len,
                   history=history,
                   trace=TRUE,
                   kernel=kernel,
                   gamma=gamma,
                   cost=cost,
                   selectFeatures=selectFeatures,
                   mc.cores=cores )

   # Vectors to summarize the results
   forecasts = rep( NA, len )
   gammas = rep( NA, len )
   costs = rep( NA, len )
   performances = rep( NA, len )
   features = rep( NA, len )

   # Format the results
   for( ll in res )
   {
      # Prepare the indexes 
      ii = ll[["index"]]
      jj = ii + NROW( ll[["forecasts"]] ) - 1

      # Copy the output
      forecasts[ii:jj] = ll[["forecasts"]]
      gammas[ii:jj] = ll[["gamma"]]
      costs[ii:jj] = ll[["cost"]]
      performances[ii:jj] = ll[["performance"]]

      # Encode the participating features as a bit mask stored in a single
      # integer. This representation limits us to max 32 features.
      features[ii:jj] = sum( 2^( ll[["features"]] - 1 ) )
   }

   sigUp = ifelse( forecasts >= 0, 1, 0 )
   sigUp[is.na( sigUp )] = 0

   sigDown = ifelse( forecasts < 0, -1, 0 )
   sigDown[is.na( sigDown)] = 0

   sig = sigUp + sigDown

   res = merge( reclass( sig, response ),
                reclass( sigUp, response ),
                reclass( sigDown, response ),
                na.trim( reclass( forecasts, response ) ),
                reclass( performances, response ),
                reclass( gammas, response ),
                reclass( costs, response ),
                reclass( features, response ),
                all=F )
   colnames( res ) = c( "Indicator",
                        "Up",
                        "Down",
                        "Forecasts",
                        "Performance",
                        "Gamma",
                        "Cost",
                        "Features" )

   return( res )
}

The pattern that I want to emphasize culminates in the call to mclapply. Each instance of the function executed in parallel (svmComputeOneForecast in this example), needs an index to identify the period it has to operate on. This is precisely what seq(1,length(startPoints) is for. Each task will be passed an unique index identifying the period. The period is retrieved by using: startPoints[uniqueIndex] and endPoints[uniqueIndex], respectively.

I have used this approach in an earlier implementation of the ARMA+GARCH system and now I am using it for the SVM system. Likely I will be using the same approach with Neural Networks, Random Forest and all the rest I have on my to-test list. Thought it may be helpful to others too.

To leave a comment for the author, please follow the link and comment on their blog: Quintuitive » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)