# Machine Learning (XGBoost) Time-Series Classification Trading Strategy

**Posts on Matthew Smith R Shenanigans**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Introduction

Using Machine Learning (ML) and past price data to predict the next periods price or direction in the stock market is not new, neither does it produce any meaningful predictions. In this post I *collapse* down a series of asset time series data into a simple classification problem and see if a Machine Learning model can do a better job at predicting the next periods direction. I apply a similar method here Time Series Classification Synthetic vs Real Financial Time Series. The **objective** and **strategy** is to invest in a single asset each day. The asset we invest in will be the asset which the Machine Learning model is most *confident* will go up in share value in the next period \(t+1\). Alternatively speaking, we invest in the asset in which the Machine Learning model gives the highest predicted probability that, a given asset will go up in value tomorrow. That is, if the model predicts on day \(t\) that asset GOOG was going to be higher than it’s previous close with a predicted probability of 0.78 and it also predicts that AMZN would go up with 0.53 probability then we would invest in GOOG today. –*we only invest in one asset each day*– . The model can be expanded to short selling and multi-asset purchasing and multi-periods.

## Load in the packages.

require(PerformanceAnalytics) library(data.table) library(dplyr) library(tibble) library(TTR) library(tidyr) library(tidyquant) library(tsfeatures) library(rsample) library(purrr) library(stringr) library(tibbletime) # tsibble clashes with the base R index() function library(xgboost) library(rvest)

Pre-define a few intialisation objects and set the ticker symbols of the companies we want to download. For this task I am not really interested in which companies I apply the strategy to. For this reason, I scrape the Wikipedia page for the S&P 500 and take a random sample of 30 tickers.

set.seed(1234) ###################### Pre-define functions for later ########################## Scale_Me <- function(x){ (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE) } # Note: I don't actually use this function but I leave it in here. ################################################################################# start_date <- "2017-01-01" end_date <- "2020-01-01" url <- "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies" symbols <- url %>% read_html() %>% html_nodes(xpath = '//*[@id="constituents"]') %>% html_table() %>% .[[1]] %>% filter(!str_detect(Security, "Class A|Class B|Class C")) %>% # Removes firms with Class A, B & C shares sample_n(30) %>% pull(Symbol) #symbols <- c( #'GOOG', 'MSFT', 'HOG', 'AAPL', 'FB' #'AMZN', 'EBAY', 'IBM', 'NFLX', 'NVDA', #'TWTR', 'WMT', 'XRX', 'INTC', 'HPE' # )

## The data

Download the data and store it into a new environment.

dataEnv <- new.env() getSymbols(symbols, from = start_date, to = end_date, #src = "yahoo", #adjust = TRUE, env = dataEnv) ## [1] "LEG" "NLSN" "SLB" "CHTR" "C" "REGN" "CCI" "SYK" "ROP" "RL" ## [11] "CERN" "CMG" "GS" "CAT" "MSI" "BR" "VRSK" "PNC" "KEYS" "PHM" ## [21] "FB" "BKR" "ABMD" "WYNN" "DG" "ADI" "GL" "TSCO" "FLS" "CDW"

Once the data has been downloaded and stored into a new environment I clean the data up a little, put all the lists into a single data frame, compute the daily returns for each asset and create the *up* or *down* direction which will be what the classification model will try to predict.

df <- eapply(dataEnv, function(x){ as.data.frame(x) %>% rename_all(function(n){ gsub("^(\\w+)\\.", "", n, perl = TRUE) } ) %>% rownames_to_column("date") }) %>% rbindlist(idcol = TRUE) %>% mutate(date = as.Date(date)) %>% group_by(.id) %>% tq_mutate( select = Adjusted, mutate_fun = periodReturn, period = "daily", type = "arithmetic" ) %>% mutate( Adj_lag = lag(Adjusted), chng_Adj = ifelse(Adjusted > Adj_lag, 1, 0) # more simply we could have just done if ret were pos/neg ) %>% select("date", ".id", "Adjusted", "daily.returns", "chng_Adj", "Open", "High", "Low", "Close") %>% as_tibble() %>% as_tbl_time(index = date) %>% setNames(c("date", "ID", "prc", "ret", "chng", "open", "high", "low", "close")) %>% drop_na(chng)

The first few observations of the data looks like:

date | ID | prc | ret | chng | open | high | low | close |
---|---|---|---|---|---|---|---|---|

2017-01-04 | CDW | 50.63446 | 0.0162981 | 1 | 51.79 | 52.60 | 51.79 | 52.38 |

2017-01-05 | CDW | 50.14147 | -0.0097364 | 0 | 52.10 | 52.66 | 51.84 | 51.87 |

2017-01-06 | CDW | 49.96746 | -0.0034704 | 0 | 51.89 | 51.95 | 51.46 | 51.69 |

We can use the `nest()`

function to put the data into convenient nested tibbles that we can simply `map()`

over and apply the `rolling_origin()`

function from the `rsample`

package such that, each of our assets will have their own `rolling_origin()`

function applied to it without any overlap or mixing of the asset classes, I do this in order to create the time series features for each period.

nested_df <- df %>% mutate(duplicate_ID = ID) %>% nest(-ID)

I split the time series data into a number of lists such that the `analysis()`

list contains 100 observations in each list and has a corresponding `assessment()`

list which contains 1 observation. Usually the `analysis()`

will become our training data set and the `assessment()`

will become our testing data set, however, here I am using the `rolling_origin()`

function to help create the time series features.

# First we set the number of days we want to construct the ts features rolled_df <- map(nested_df$data, ~ rolling_origin(., initial = 100, assess = 1, cumulative = FALSE, skip = 0))

## Time-Series Functions

In order to create the time series variables I use the `tsfeatures`

package but there is also the `feasts`

packages here. For this model I simply select a few functions of interest from the `tsfeatures`

package.

functions <- c( "entropy", # Measures the "forecastability" of a series - low values = high sig-to-noise, large vals = difficult to forecast "stability", # means/variances are computed for all tiled windows - stability is the variance of the means "lumpiness", # Lumpiness is the variance of the variances "max_level_shift", # Finds the largest mean shift between two consecutive windows (returns two values, size of shift and time index of shift) "max_var_shift", # Finds the max variance shift between two consecutive windows (returns two values, size of shift and time index of shift) "max_kl_shift", # Finds the largest shift in the Kulback-Leibler divergence between two consecutive windows (returns two values, size of shift and time index of shift) "crossing_points" # Number of times a series crosses the mean line )

I wrote this code a little while ago and at the time I wrapped the model into a function. I think it would be more fun to exclusively stick to using just `map()`

functions instead of `function(SYMB)`

. The function does the following for each asset in our data:

Using the out-of-sample `t+1`

(`assessment`

) data, bind these lists together into a single data frame. Next apply the `functions`

character string to call the functions from the `tsfeatures`

package, apply these functions to the in-sample (`analysis`

) data (which consists of 100 observations each), such that, we obtain a single *collapsed* down observation that we can just bind together. Finally we bind the columns of these two data sets together using `bind_cols()`

. After this I rename the `chng`

variable and rename the time series feature variables to something more dynamic using `~str_c("X", seq_along(.))`

so we can just add functions to the `functions`

character string and not have to worry about renaming the variables individually in order for the model to work.

Once this is done, I create the Machine Learning data set again using the `rolling_origin()`

function. The first `rolling_origin()`

function was used to help *collapse* the time series data down on a rolling basis by taking the previous 100 days of data and calculating the `tsfeatures`

function on it - *a similar method to calculating a rolling mean/sd using the rollapply() function from the zoo package*.

I next split the data into `X`

variables with `X_train`

and `X_test`

and the corresponding `Y`

variable with `Y_train`

and `Y_test`

. The package `xgboost`

expects a certain type of `xgb.DMatrix()`

which is what `dtrain`

and `dtest`

are doing.

Then, I set the XGBoost parameters and apply the XGBoost model. - *Suitable cross validation should be performed at this point, however I will leave this for another post since time series cross validation is quite tricky and there is no function in R which helps with this type of cross validation (that I have found as of 2020-02-02)* -

Once the model has been trained, I make the predictions.

## Apply the model

The function to compute all this is the following:

Prediction_Model <- function(SYMB){ data <- bind_cols( map(rolled_df[[SYMB]]$splits, ~ assessment(.x)) %>% bind_rows(), map(rolled_df[[SYMB]]$splits, ~ analysis(.x)) %>% map(., ~tsfeatures(.x[["ret"]], functions)) %>% # Compute the TSFeatures bind_rows() ) %>% rename(Y = chng) %>% rename_at(vars(-c(1:9)), ~str_c("X", seq_along(.))) ml_data <- data %>% as_tibble() %>% rolling_origin( initial = 200, assess = 1, cumulative = FALSE, skip = 0) X_train <- map( ml_data$splits, ~ analysis(.x) %>% as_tibble(., .name_repair = "universal") %>% select(starts_with("X")) %>% as.matrix() ) Y_train <- map( ml_data$splits, ~ analysis(.x) %>% as_tibble(., .name_repair = "universal") %>% select(starts_with("Y")) %>% as.matrix() ) X_test <- map( ml_data$splits, ~ assessment(.x) %>% as_tibble(., .name_repair = "universal") %>% select(starts_with("X")) %>% as.matrix() ) Y_test <- map( ml_data$splits, ~ assessment(.x) %>% as_tibble(., .name_repair = "universal") %>% select(starts_with("Y")) %>% as.matrix() ) ############################################################# dtrain <- map2( X_train, Y_train, ~ xgb.DMatrix(data = .x, label = .y, missing = "NaN") ) dtest <- map( X_test, ~ xgb.DMatrix(data = .x, missing = "NaN") ) # Parameters: watchlist <- list("train" = dtrain) params <- list("eta" = 0.1, "max_depth" = 5, "colsample_bytree" = 1, "min_child_weight" = 1, "subsample"= 1, "objective"="binary:logistic", "gamma" = 1, "lambda" = 1, "alpha" = 0, "max_delta_step" = 0, "colsample_bylevel" = 1, "eval_metric"= "auc", "set.seed" = 176) # Train the XGBoost model xgb.model <- map( dtrain, ~ xgboost(params = params, data = .x, nrounds = 10, watchlist) ) xgb.pred <- map2( .x = xgb.model, .y = dtest, .f = ~ predict(.x, newdata = .y, type = 'prob') ) preds <- cbind(plyr::ldply(xgb.pred, data.frame), plyr::ldply(Y_test, data.frame)) %>% setNames(c("pred_probs", "actual")) %>% bind_cols(plyr::ldply(map(ml_data$splits, ~assessment(.x)))) %>% rename(ID = duplicate_ID) %>% #select(pred_probs, actual, date, ID, prc, ret) %>% as_tibble() return(preds) }

We can apply the above model to create the time series features, train and test on each of our assets by running the following.

Sys_t_start <- Sys.time() Resultados <- lapply(seq(1:length(rolled_df)), Prediction_Model) Sys_t_end <- Sys.time() round(Sys_t_end - Sys_t_start, 2)

The `Resultados`

output will give us a list the length of the number of assets we have in our data. The first few observations of the first asset in the list looks like:

pred_probs | actual | date | prc | ret | Y | open | high | low | close | ID | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0.7304490 | 0 | 2018-03-15 | 73.17622 | -0.0106061 | 0 | 75.36 | 75.51 | 74.18 | 74.63 | CDW | 0.9870653 | 0.1149955 | 0.7047308 | 1.277064 | 71 | 3.161538 | 63 | 3.245055 | 76 | 56 |

0.5149571 | 1 | 2018-03-16 | 74.35286 | 0.0160795 | 1 | 74.78 | 75.99 | 73.04 | 75.83 | CDW | 0.9886519 | 0.0745409 | 0.8408280 | 1.273320 | 70 | 3.143027 | 62 | 3.227452 | 75 | 55 |

0.6207952 | 0 | 2018-03-19 | 72.53889 | -0.0243967 | 0 | 75.35 | 75.35 | 73.45 | 73.98 | CDW | 0.9902178 | 0.0901013 | 0.7192391 | 1.275344 | 69 | 3.153024 | 61 | 3.227452 | 74 | 55 |

Which consist of the XGBoost predicted probabilities, the actual observed result, the date of the result (of the out-of-sample testing data), the observed share price, the calculated daily returns, (a duplicate of the observed result), the OHLC data we collected from Yahoo and finally the time series features we constructed and then reneamed to \(X_{n}\)

The **objective** of this strategy was to invest every day in the asset which obtained the highest predicted probability that the market was going to go up. That is, if the model predicts on day \(t\) that asset `GOOG`

was going to be higher than it’s previous close with a predicted probability of 0.78 and it also predicts that `AMZN`

would go up with 0.53 probability then we would invest in `GOOG`

today. That is, we only invest in the asset with the highest predicted probability that the market is going to go up.

Therefore, I create a new data frame called `top_assets`

which basically gives me the highest predicted probability across all assets each day.

top_assets <- plyr::ldply(Resultados) %>% #select(pred_probs, actual, date, open, high, low, close, prc, ret) %>% group_by(date) %>% arrange(desc(pred_probs)) %>% dplyr::slice(1) %>% ungroup() %>% select(date, everything()) %>% rename(score = pred_probs) %>% select(-actual)

## Strategy Assessment

The first 10 days of the strategy investments look like:

date | score | prc | ret | Y | open | high | low | close | ID | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

2018-03-15 | 0.7304490 | 73.17622 | -0.0106061 | 0 | 75.36 | 75.51 | 74.18 | 74.63 | CDW | 0.9870653 | 0.1149955 | 0.7047308 | 1.2770644 | 71 | 3.161538 | 63 | 3.245055 | 76 | 56 |

2018-03-16 | 0.6899720 | 293.04999 | -0.0051601 | 0 | 295.13 | 297.61 | 289.27 | 293.05 | ABMD | 0.9918187 | 0.0769474 | 2.4417643 | 1.0584676 | 69 | 4.917599 | 62 | 7.861935 | 80 | 39 |

2018-03-19 | 0.7299674 | 101.46883 | -0.0081591 | 0 | 108.98 | 109.06 | 107.40 | 108.19 | CCI | 0.9894902 | 0.0705445 | 0.3788407 | 0.9924987 | 68 | 1.786445 | 5 | 1.402126 | 89 | 46 |

2018-03-20 | 0.7370850 | 60.44966 | 0.0132920 | 1 | 65.13 | 66.10 | 65.13 | 65.56 | SLB | 0.9830999 | 0.1717591 | 0.1725298 | 1.1379607 | 53 | 1.853699 | 9 | 1.739650 | 10 | 47 |

2018-03-21 | 0.7003193 | 334.51999 | 0.0448199 | 1 | 320.94 | 339.20 | 320.32 | 334.52 | CMG | 0.9532525 | 0.0669860 | 2.4899030 | 1.6249110 | 67 | 4.775679 | 73 | 12.454513 | 10 | 57 |

2018-03-22 | 0.7438304 | 87.40306 | -0.0243534 | 0 | 91.53 | 92.43 | 90.48 | 90.54 | ADI | 0.9796797 | 0.0855250 | 0.8606480 | 1.4206984 | 65 | 2.718067 | 64 | 8.589078 | 72 | 49 |

2018-03-23 | 0.6494384 | 237.70613 | -0.0290578 | 0 | 253.63 | 253.99 | 244.93 | 245.26 | GS | 0.9685330 | 0.0615987 | 0.7610257 | 1.1137175 | 63 | 3.588928 | 58 | 4.667287 | 71 | 50 |

2018-03-26 | 0.6868502 | 70.18565 | 0.0238877 | 1 | 70.74 | 71.71 | 69.71 | 71.58 | CDW | 0.9885363 | 0.1109430 | 0.5278402 | 1.1776141 | 64 | 2.688307 | 56 | 3.038961 | 69 | 56 |

2018-03-27 | 0.7274348 | 57.21453 | -0.0020772 | 0 | 58.13 | 58.41 | 57.16 | 57.65 | CERN | 0.9787359 | 0.1971365 | 0.2600627 | 1.0608221 | 62 | 2.325379 | 56 | 2.115355 | 69 | 52 |

2018-03-28 | 0.7031060 | 68.56778 | -0.0039882 | 0 | 70.36 | 70.39 | 69.26 | 69.93 | CDW | 0.9869206 | 0.1328377 | 0.5033828 | 1.1567020 | 62 | 2.593677 | 54 | 3.010604 | 67 | 56 |

We can see that the column `score`

is the probability for the asset with the highest predicted probability that it’s price was going to be greater than it’s previous close. The `ID`

column gives us the asset ticker we invest in.

Next I want to analyse the strategy of picking the *best* predicted winners against the S&P500 bench mark and therefore download the S&P 500 index.

top_assets <- xts(top_assets[,c(2:ncol(top_assets))], order.by = top_assets$date) # put top_assets into xts format # Analyse strategy getSymbols("SPY", from = start_date, to = end_date, src = "yahoo") ## [1] "SPY" #detach("package:tsibble", unload = TRUE) # tsibble clashes with the base R index() function SPY$ret_Rb <- Delt(SPY$SPY.Adjusted) SPY <- SPY[index(SPY) >= min(index(top_assets))] RaRb <- cbind(top_assets, SPY)

From here we can see how the strategy compares with the S&P 500. I show a number of statistics for analysing asset returns from the `PerformanceAnalytics`

package. I have not expanded the model to include short selling or construct multi-asset portfolios of the top \(N\) assets.

We can plot the performance of our strategy:

charts.PerformanceSummary(RaRb[, c("ret", "ret_Rb")], geometric = FALSE, wealth.index = FALSE, main = "Strategy vs. Market")

Take a look at the drawdown and risk metrics.

## ret ret_Rb ## Sterling ratio 0.1870 0.3879 ## Calmar ratio 0.2551 0.5884 ## Burke ratio 0.2251 0.5344 ## Pain index 0.0955 0.0283 ## Ulcer index 0.1189 0.0455 ## Pain ratio 0.7337 4.0290 ## Martin ratio 0.5891 2.5027 ## daily downside risk 0.0111 0.0066 ## Annualised downside risk 0.1768 0.1044 ## Downside potential 0.0054 0.0029 ## Omega 1.0722 1.1601 ## Sortino ratio 0.0351 0.0714 ## Upside potential 0.0058 0.0034 ## Upside potential ratio 0.7027 0.6124 ## Omega-sharpe ratio 0.0722 0.1601

Take a closer look at the drawdown information.

## $ret ## From Trough To Depth Length To Trough Recovery ## 1 2018-08-31 2019-01-03 2019-09-16 -0.2746 261 85 176 ## 2 2019-11-06 2019-12-03-0.1300 39 19 NA ## 3 2019-10-01 2019-10-18 2019-10-29 -0.0810 21 14 7 ## 4 2018-03-22 2018-04-20 2018-05-09 -0.0773 34 21 13 ## 5 2018-08-10 2018-08-15 2018-08-20 -0.0474 7 4 3 ## ## $ret_Rb ## From Trough To Depth Length To Trough Recovery ## 1 2018-09-21 2018-12-24 2019-04-12 -0.1935 140 65 75 ## 2 2019-05-06 2019-06-03 2019-06-20 -0.0662 33 20 13 ## 3 2018-03-15 2018-04-02 2018-06-04 -0.0610 56 12 44 ## 4 2019-07-29 2019-08-05 2019-10-25 -0.0602 64 6 58 ## 5 2018-06-13 2018-06-27 2018-07-09 -0.0300 18 11 7

Compare the returns.

chart.Boxplot(RaRb[,c("ret", "ret_Rb")], main = "Returns")

Compare return statistics.

table.Stats(RaRb[, c("ret", "ret_Rb")]) %>% t() %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Observations | NAs | Minimum | Quartile 1 | Median | Arithmetic Mean | Geometric Mean | Quartile 3 | Maximum | SE Mean | LCL Mean (0.95) | UCL Mean (0.95) | Variance | Stdev | Skewness | Kurtosis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

ret | 453 | 0 | -0.0669 | -0.0068 | 0.0006 | 0.0004 | 0.0003 | 0.0087 | 0.0642 | 0.0007 | -0.0011 | 0.0018 | 0.0002 | 0.0156 | -0.2542 | 2.8842 |

ret_Rb | 453 | 0 | -0.0324 | -0.0030 | 0.0006 | 0.0005 | 0.0004 | 0.0054 | 0.0505 | 0.0004 | -0.0004 | 0.0013 | 0.0001 | 0.0091 | -0.2949 | 3.6264 |

Compare Sharpe Information.

lapply(RaRb[, c("ret", "ret_Rb")], function(x){SharpeRatio(x)}) ## $ret ## ret ## StdDev Sharpe (Rf=0%, p=95%): 0.025027498 ## VaR Sharpe (Rf=0%, p=95%): 0.015346462 ## ES Sharpe (Rf=0%, p=95%): 0.009618405 ## ## $ret_Rb ## ret_Rb ## StdDev Sharpe (Rf=0%, p=95%): 0.05152014 ## VaR Sharpe (Rf=0%, p=95%): 0.03218952 ## ES Sharpe (Rf=0%, p=95%): 0.01913213

Plot the Risk - Return scatter plot.

chart.RiskReturnScatter(RaRb[, c("ret", "ret_Rb")], # check this plot a little more Rf=.03/252, scale = 252, # for daily data main = "Risk - Return over the period")

Plot the rolling return, risk and Sharpe performance.

charts.RollingPerformance(RaRb[, c("ret", "ret_Rb")], Rf=.03/12, colorset = c("red", rep("darkgray",5), "orange", "green"), lwd = 2)

Compute the yearly returns.

lapply(RaRb[, c("ret")],function(x){periodReturn( x, period = 'yearly', type = 'arithmetic')}) # change type to log for continuous ## $ret ## yearly.returns ## 2018-12-31 -1.855083 ## 2019-12-31 -1.475181 lapply(RaRb[, c("ret_Rb")],function(x){periodReturn( x, period = 'yearly', type = 'arithmetic')}) ## $ret_Rb ## yearly.returns ## 2018-12-31 -9.0376638 ## 2019-12-31 -0.7226497

**leave a comment**for the author, please follow the link and comment on their blog:

**Posts on Matthew Smith R Shenanigans**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.