**Data Shenanigans » R**, and kindly contributed to R-bloggers)

You could say that the following post is an answer/comment/addition to Quintuitive, though I would consider it as a small introduction to parallel computing with *snowfall* using the thoughts of Quintuitive as an example.

A quick recap: Say you create a model that is able to forecast 60% of market directions (that is, in 6 out of 10 cases you can predict the direction of the market or an asset for a respective time-period), how well would you do using this strategy? Hint: if you have a model that is able to predict 60% out-of-sample (think test dataset instead of training dataset) please make sure to share the strategy with me! 🙂

There are multiple ways to do it, I will show you how to simulate multiple cases using real-life financial data from the German Dax index, Monte-Carlo techniques, and parallel computing using the *snowfall*-package of the R language.

The piece is structured as follows:

- Load financial data using quantmod
- Show one simulation case with a probability of 51%
- Simulate n cases for one probability and average the result
- Simulate k different cases with different probabilities
- Use the snowfall-package to use parallel techniques

# 1. Load Financial Data using Quantmod

To have some data we look at the German Dax index (*^GDAXI*), which consists of the 30 largest companies in Germany. We can load the data and transform it to a dataframe like this:

# load packages library(quantmod) # download DAX data from Yahoo dax <- getSymbols("^GDAXI", from = "2000-01-01", auto.assign = F) # create a data.frame Sdata, that contains the stock data Sdata <- data.frame(date = index(dax), price = as.numeric(Ad(dax))) # calculate returns Sdata$rets <- Sdata$price / c(NA, Sdata$price[1:(nrow(Sdata) - 1)]) - 1 head(Sdata) # date price rets # 1 2010-01-04 6048.30 NA # 2 2010-01-05 6031.86 -0.0027181096 # 3 2010-01-06 6034.33 0.0004095279 # 4 2010-01-07 6019.36 -0.0024808413 # 5 2010-01-08 6037.61 0.0030318839 # 6 2010-01-11 6040.50 0.0004786889 # create a first plot to have a look at the price development plot(x = Sdata$date, y = Sdata$price, type = "l", main = "Dax Development")

# 2. Show one simulation case with a probability of 51%

Now that we have some data, we create a function *get.fprice* that takes in three arguments: the returns of an asset, the percentage of right predictions, and an initial price of the investment (or just the first price of the benchmark). The function returns a vector of values of the investment.

get.fprice <- function(rets, perc.right, init.price){ # 1. sample the goodness of the returns good.forecast <- sample(x = c(T, F), size = length(rets), prob = c(perc.right, 1 - perc.right), replace = T) # 2. get the forecasted directions, the same as the true rets if good.forecast = T dir <- ifelse(rets > 0, 1, -1) forecast.dir <- ifelse(good.forecast, dir, -dir) # if the percentage sampled should be displayed # mean(dir == forecast.dir, na.rm = T) # 3. calculate the return of the forecast forecast.ret <- forecast.dir * rets # 4. calculate the prices forecast.price <- cumprod(1 + forecast.ret[2:length(forecast.ret)]) * init.price forecast.price <- c(init.price, forecast.price) return(forecast.price) }

With this function we are able to create a single simulation for one probability.

# set a seed for reproducability set.seed(42) # simulate one series of prices Sdata$fprice <- get.fprice(rets = Sdata$rets, perc.right = 0.51, init.price = Sdata$price[1]) # plot the two developments plot(x = Sdata$date, y = Sdata$price, type = "l", main = "Dax vs. Forecast", xlab = "Date", ylab = "Price", ylim = c(0, max(Sdata$price, Sdata$fprice))) lines(x = Sdata$date, y = Sdata$fprice, col = "red") legend("topleft", c("Dax", paste(perc.right, "forecast")), col = 1:2, lty = 1)

If you play around with the seed, you will see that not all cases show a similar picture. This is something inherent to simulations. To avoid being overly dependent on the seed value for the random number generation, we use the law of large numbers and simulate not one but multiple cases and use the average as a result.

# 3. Simulate n cases for one probability and average the result

Monte-Carlo based simulations are multiple simulation of random developments. With the next code-snippet we can simulate n cases per probability (i.e., we take n simulated paths and average them).

First we create a function *get.fprice.n* that works like the *get.fprice*-function, but creates not one but n = 10,000 cases using the apply-function family.

get.fprice.n <- function(rets, perc.right, init.price, n){ # create a function that produces the goodness of the forecast get.good.forecast <- function(x){ good.forecast <- sample(x = c(T, F), size = length(rets), prob = c(perc.right, 1 - perc.right), replace = T) return(good.forecast) } # 1. sample the goodness of the returns good.forecasts <- sapply(1:n, get.good.forecast) # 2. get the forecasted directions, the same as the true rets if good.forecast = T dir <- ifelse(rets > 0, 1, -1) forecast.dirs <- apply(good.forecasts, 2, function(x) { ifelse(x, dir, -dir) }) # 3. calculate the return of the forecast forecast.rets <- forecast.dirs * rets # 4. calculate the prices forecast.prices <- apply(forecast.rets, 2, function(x) { cumprod(1 + x[2:length(x)]) * init.price }) forecast.prices <- rbind(rep(init.price, ncol(forecast.prices)), forecast.prices) # collapse the n simulations to just one by taking the average forecast.price <- apply(forecast.prices, 1, mean) return(forecast.price) } # simulate 10.000 cases # set a seed for reproducability, # should not matter due to the Law Of Large Numbers set.seed(42) # simulate 10.000 series of prices t <- Sys.time() Sdata$fprice <- get.fprice.n(rets = Sdata$rets, perc.right = 0.51, init.price = Sdata$price[1], n = 10000) Sys.time() - t # takes 5.69257 seconds on my machine # plot the two developments plot(x = Sdata$date, y = Sdata$price, type = "l", main = "Dax vs. Forecasts", xlab = "Date", ylab = "Price", ylim = c(0, max(Sdata$price, Sdata$fprice))) lines(x = Sdata$date, y = Sdata$fprice, col = "red") legend("topleft", c("Dax", "aggregated forecasts"), col = 1:2, lty = 1)

# 4. Simulate k different cases with different probabilities

You may want to compare not just one case with 51% but maybe 2, 5 or 100 cases (k) if the time allows it. The following section creates k = 4 cases that range from 45% to 55% of directions predicted correctly. Each case is simulated n = 10.000 times and then averaged.

k <- 4 # the percentages that will be used later on perc.right <- seq(from = 0.45, to = 0.55, length.out = k) # [1] 0.4500000 0.4833333 0.5166667 0.5500000 # simulate k cases n times, equals 40.000 times t <- Sys.time() forecasted.prices <- sapply(perc.right, function(x) { get.fprice.n(rets = Sdata$rets, perc.right = x, init.price = Sdata$price[1], n = 10000) }) Sys.time() - t # takes 21.592 seconds on my machine # plot the results plot(x = Sdata$date, y = Sdata$price, type = "l", main = "Dax vs. Forecasts", xlab = "Date", ylab = "Price", ylim = c(0, max(forecasted.prices, Sdata$price))) for (i in 1:k){ lines(x = Sdata$date, y = forecasted.prices[, i], col = (i + 1)) } legend("topleft", c("Dax", paste0("P = ", round(perc.right, 2), sep = "")), col = 1:(k + 1), lty = 1, cex = 0.75)

# 5. Use the snowfall-package to use parallel techniques

Now to the interesting part – using parallel techniques to reduce the computation time (time to beat: ~21 seconds). There are many packages out there in the “R-osphere”. I found the *snowfall* package most useful and easiest to use. If you want to get a broader picture of the possibilities you should head over to Ryan’s slides, which give a comprehensive analysis of high-performance computing in R.

My simplified understanding of parallel computing so far is that there is one master node (think of it as a departement head that coordinates the workers) and multiple slave nodes (the workers). The master node then distributes tasks to each slave that computes the result and sends it back to the master.

In snowfall we first have to specify the number of slaves by initiating the clusters with the *sfInit* command. As each slave starts a new environment, we have to export the libraries, functions, and datasets that the slaves will use by calling the *sfExport*-function (alternative we can call *sfExportAll*, which exports all elements in the current environment). *Snowfall* has it’s own set of functions that work very similar to the apply-family, which we call to hand over the computation to the master node. To be more precise we call *sfClusterApplyLB* (LB stands for load balancing). As a last step we have to stop the cluster by calling *sfStop*. Easy, right?

The full snowfall code looks like this:

# load the library library(snowfall) # initiate the data k <- 4 perc.right <- seq(from = 0.45, to = 0.55, length.out = k) # initiate the clusters, in this case 4 slaves sfInit(parallel = T, cpus = 4) # if you are unsure how many cpus you can use, try # ?parallel::detectCores # parallel::detectCores(logical = T) # export the necessary data and functions to each cluster sfExport("get.fprice.n", "Sdata") t <- Sys.time() # use the snowfall cluster-apply, which works similar to sapply # note that I used sfClusterApplyLB and not sfClusterApply # the LB stands for Load-Balancing result <- sfClusterApplyLB(perc.right, function(x){ get.fprice.n(rets = Sdata$rets, perc.right = x, init.price = Sdata$price[1], n = 10000) }) Sys.time() - t # 9.861 seconds # IMPORTANT: Stop the cluster so it doesn't messes around in the background sfStop() # have a look at the data str(result) # List of 4 # $ : num [1:1463] 6048 6047 6046 6045 6043 ... # $ : num [1:1463] 6048 6048 6048 6047 6047 ... # $ : num [1:1463] 6048 6049 6049 6049 6050 ... # $ : num [1:1463] 6048 6050 6050 6052 6053 ... # convert to a data.frame forecasted.prices <- data.frame(matrix(unlist(result), ncol = length(result))) head(forecasted.prices) # X1 X2 X3 X4 # 1 6048.300 6048.300 6048.300 6048.300 # 2 6046.659 6047.672 6048.888 6050.085 # 3 6046.424 6047.589 6048.989 6050.323 # 4 6045.218 6047.276 6049.403 6051.897 # 5 6043.136 6046.546 6049.886 6053.465 # 6 6042.890 6046.399 6049.992 6053.763 # and finally the last plot plot(x = Sdata$date, y = Sdata$price, type = "l", main = "Snowfall Forecasts", xlab = "Date", ylab = "Price", ylim = c(0, max(forecasted.prices, Sdata$price))) for (i in 1:k){ lines(x = Sdata$date, y = forecasted.prices[, i], col = (i + 1)) } legend("topleft", c("Dax", paste0("P = ", round(perc.right, 2), sep = "")), col = 1:(k + 1), lty = 1, cex = 0.75)

The plot shows basically the same picture compared to the last picture. Everything else would be very strange.

Using parallel techniques, we can reduce the time for the computations from 21 seconds to 9 seconds. If we increase the number of cases (n = 100,000 instead of 10,000) then we should see that the parallel computation takes only a quarter of the time with 4 clusters.

# Summary

A few lines of code shows us that even a small advantage (being able to forecast 5%p more correct Dax movements than a coin) can lead to monstrous returns of roughly 30% p.a. (CAGR of an initial investment of 6,050 and a value of 23,480 after 5.5 years).

However, the take-away should be that if you find such a model, make sure that you backtest the s**** out of it before telling anyone about it. Ex post, everyone is a billionaire.

Another take-away and the purpose of this post is how to use Monte-Carlo and parallel computing techniques to create a simple and fast simulation to check something.

# Outro

I am not an expert on parallel computing, so should you find an error in the explanation or in the code, please leave a comment and I will happily correct the error.

Furthermore, if you have any questions, comments or ideas, please leave a comment or send me an email.

**leave a comment**for the author, please follow the link and comment on their blog:

**Data Shenanigans » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...