Lately I have been working on a trading system based on Support Vector Machine (SVM) regression (and yes, if you wonder, there are a few posts planned to share the results). In this post however I want to share an interesting problem I had to deal with.
Few days ago, I started running simulations using my latest code over years of historic data. Everything seemed to work fine, and the results looked promising. Experience has thought me that there is nothing like too much testing when it comes to trading strategy simulations, in other words, when your money is on the line. So, I decided to run one more test, which I usually do – to confirm that simulations with the same parameters, produce the same indicator.
By now it’s pretty clear I guess, that the result of this test was negative – the repeated simulation yielded slightly different results.
Quickly it was clear that randomness is used in the SVM tuning. My first guess was that some default function parameter allows for randomness. Wrong. A quick check of (relevant function from the e1071 package) svm, tune and tune.control proved me wrong – all seed/probability parameters seemed to be turned off or initialized in such way that stable results are to be expected.
A quick look at the code of the package showed that the function likely to produce the different results is a call to the sample. Now that was helpful, since I asked the question, what is the random seed that sample is using – aha, it is the default random seed of the process. Do you see the problem now?
Remember, I am executing the tuning in parallel, using the mclapply function from the parallel package. The function documentation mentions that the random seed in each new process is initialized by using the current seed. That was it – the random seed at the beginning of different calls to my tuning wrapper is different, thus, I was getting different results. The fix was to set the same starting seed at the beginning my tuning wrapper.
Actually, mclapply provides an mc.set.seed, but this parameter takes only TRUE/FALSE, not a numeric which is used to initialize the random seed. I thought there is room for improvement here for the API – so I suggested it to the R development. They didn’t like the idea enough, so beware next time when running randomized simulations in parallel. The also suggested an alternative solution which will work in most situations:
set.seed( 1234 ) mclapply( ..., mc.set.sedd=TRUE, ... )
This solution works as long as the input data is the same and once a process finishes it always picks the next index. It won’t work if the inputs for each process are queued in advance. For my purposes it won’t work, because the results are likely to be different for a given date, say Nov 11, 2012, between two simulations starting back at different points in time, say one starting in the 50s and one in the 60s.