Machine Learning (XGBoost) Time-Series Classification Trading Strategy

[This article was first published on Posts on Matthew Smith R Shenanigans, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Using Machine Learning (ML) and past price data to predict the next periods price or direction in the stock market is not new, neither does it produce any meaningful predictions. In this post I collapse down a series of asset time series data into a simple classification problem and see if a Machine Learning model can do a better job at predicting the next periods direction. I apply a similar method here Time Series Classification Synthetic vs Real Financial Time Series. The objective and strategy is to invest in a single asset each day. The asset we invest in will be the asset which the Machine Learning model is most confident will go up in share value in the next period \(t+1\). Alternatively speaking, we invest in the asset in which the Machine Learning model gives the highest predicted probability that, a given asset will go up in value tomorrow. That is, if the model predicts on day \(t\) that asset GOOG was going to be higher than it’s previous close with a predicted probability of 0.78 and it also predicts that AMZN would go up with 0.53 probability then we would invest in GOOG today. –we only invest in one asset each day– . The model can be expanded to short selling and multi-asset purchasing and multi-periods.

Load in the packages.

require(PerformanceAnalytics)
library(data.table)
library(dplyr)
library(tibble)
library(TTR)
library(tidyr)
library(tidyquant)
library(tsfeatures)
library(rsample)
library(purrr)
library(stringr)
library(tibbletime) # tsibble clashes with the base R index() function
library(xgboost)
library(rvest)

Pre-define a few intialisation objects and set the ticker symbols of the companies we want to download. For this task I am not really interested in which companies I apply the strategy to. For this reason, I scrape the Wikipedia page for the S&P 500 and take a random sample of 30 tickers.

set.seed(1234)
###################### Pre-define functions for later ##########################

Scale_Me <- function(x){
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}                                  # Note: I don't actually use this function but I leave it in here.

#################################################################################

start_date <- "2017-01-01"
end_date <- "2020-01-01"

url <- "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
symbols <- url %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="constituents"]') %>% 
  html_table() %>% 
  .[[1]] %>% 
  filter(!str_detect(Security, "Class A|Class B|Class C")) %>%     # Removes firms with Class A, B & C shares
  sample_n(30) %>% 
  pull(Symbol)


#symbols <- c(
  #'GOOG', 'MSFT', 'HOG', 'AAPL', 'FB' 
  #'AMZN', 'EBAY', 'IBM', 'NFLX', 'NVDA',
  #'TWTR', 'WMT', 'XRX', 'INTC', 'HPE'
# )

The data

Download the data and store it into a new environment.

dataEnv <- new.env()
getSymbols(symbols, 
           from = start_date, 
           to = end_date, 
           #src = "yahoo", 
           #adjust = TRUE, 
           env = dataEnv)

##  [1] "LEG"  "NLSN" "SLB"  "CHTR" "C"    "REGN" "CCI"  "SYK"  "ROP"  "RL"  
## [11] "CERN" "CMG"  "GS"   "CAT"  "MSI"  "BR"   "VRSK" "PNC"  "KEYS" "PHM" 
## [21] "FB"   "BKR"  "ABMD" "WYNN" "DG"   "ADI"  "GL"   "TSCO" "FLS"  "CDW"

Once the data has been downloaded and stored into a new environment I clean the data up a little, put all the lists into a single data frame, compute the daily returns for each asset and create the up or down direction which will be what the classification model will try to predict.

df <- eapply(dataEnv, function(x){
  as.data.frame(x) %>% 
    rename_all(function(n){
      gsub("^(\\w+)\\.", "", n, perl = TRUE)
    }
    ) %>%
    rownames_to_column("date")  
}) %>% 
  rbindlist(idcol = TRUE) %>% 
  mutate(date = as.Date(date)) %>% 
  group_by(.id) %>% 
  tq_mutate(
    select = Adjusted,
    mutate_fun = periodReturn,
    period = "daily",
    type = "arithmetic"
  ) %>% 
  mutate(
    Adj_lag = lag(Adjusted),
    chng_Adj = ifelse(Adjusted > Adj_lag, 1, 0) # more simply we could have just done if ret were pos/neg
  ) %>% 
  select("date", ".id", "Adjusted", "daily.returns", "chng_Adj", "Open", "High", "Low", "Close") %>% 
  as_tibble() %>% 
  as_tbl_time(index = date) %>% 
  setNames(c("date", "ID", "prc", "ret", "chng", "open", "high", "low", "close")) %>% 
  drop_na(chng)

The first few observations of the data looks like:

date ID prc ret chng open high low close
2017-01-04 CDW 50.63446 0.0162981 1 51.79 52.60 51.79 52.38
2017-01-05 CDW 50.14147 -0.0097364 0 52.10 52.66 51.84 51.87
2017-01-06 CDW 49.96746 -0.0034704 0 51.89 51.95 51.46 51.69

We can use the nest() function to put the data into convenient nested tibbles that we can simply map() over and apply the rolling_origin() function from the rsample package such that, each of our assets will have their own rolling_origin() function applied to it without any overlap or mixing of the asset classes, I do this in order to create the time series features for each period.

nested_df <- df %>%
  mutate(duplicate_ID = ID) %>% 
  nest(-ID)

I split the time series data into a number of lists such that the analysis() list contains 100 observations in each list and has a corresponding assessment() list which contains 1 observation. Usually the analysis() will become our training data set and the assessment() will become our testing data set, however, here I am using the rolling_origin() function to help create the time series features.

# First we set the number of days we want to construct the ts features
rolled_df <- map(nested_df$data, ~ rolling_origin(.,
                                                  initial = 100,
                                                  assess = 1,
                                                  cumulative = FALSE,
                                                  skip = 0))

Time-Series Functions

In order to create the time series variables I use the tsfeatures package but there is also the feasts packages here. For this model I simply select a few functions of interest from the tsfeatures package.

functions <- c(
  "entropy",                   # Measures the "forecastability" of a series - low values = high sig-to-noise, large vals = difficult to forecast
  "stability",                 # means/variances are computed for all tiled windows - stability is the variance of the means
  "lumpiness",                 # Lumpiness is the variance of the variances
  "max_level_shift",           # Finds the largest mean shift between two consecutive windows (returns two values, size of shift and time index of shift)
  "max_var_shift",             # Finds the max variance shift between two consecutive windows (returns two values, size of shift and time index of shift)
  "max_kl_shift",              # Finds the largest shift in the Kulback-Leibler divergence between two consecutive windows (returns two values, size of shift and time index of shift)
  "crossing_points"            # Number of times a series crosses the mean line
)

I wrote this code a little while ago and at the time I wrapped the model into a function. I think it would be more fun to exclusively stick to using just map() functions instead of function(SYMB). The function does the following for each asset in our data:

Using the out-of-sample t+1 (assessment) data, bind these lists together into a single data frame. Next apply the functions character string to call the functions from the tsfeatures package, apply these functions to the in-sample (analysis) data (which consists of 100 observations each), such that, we obtain a single collapsed down observation that we can just bind together. Finally we bind the columns of these two data sets together using bind_cols(). After this I rename the chng variable and rename the time series feature variables to something more dynamic using ~str_c("X", seq_along(.)) so we can just add functions to the functions character string and not have to worry about renaming the variables individually in order for the model to work.

Once this is done, I create the Machine Learning data set again using the rolling_origin() function. The first rolling_origin() function was used to help collapse the time series data down on a rolling basis by taking the previous 100 days of data and calculating the tsfeatures function on it – a similar method to calculating a rolling mean/sd using the rollapply() function from the zoo package.

I next split the data into X variables with X_train and X_test and the corresponding Y variable with Y_train and Y_test. The package xgboost expects a certain type of xgb.DMatrix() which is what dtrain and dtest are doing.

Then, I set the XGBoost parameters and apply the XGBoost model. – Suitable cross validation should be performed at this point, however I will leave this for another post since time series cross validation is quite tricky and there is no function in R which helps with this type of cross validation (that I have found as of 2020-02-02)

Once the model has been trained, I make the predictions.

Apply the model

The function to compute all this is the following:

Prediction_Model <- function(SYMB){
  data <- bind_cols(
    map(rolled_df[[SYMB]]$splits, ~ assessment(.x)) %>%
      bind_rows(),
    map(rolled_df[[SYMB]]$splits, ~ analysis(.x)) %>%
      map(., ~tsfeatures(.x[["ret"]], functions)) %>%          # Compute the TSFeatures
      bind_rows()
  ) %>%
    rename(Y = chng) %>%
    rename_at(vars(-c(1:9)), ~str_c("X", seq_along(.)))        
  
  ml_data <- data %>% 
    as_tibble() %>% 
    rolling_origin(
      initial    = 200,
      assess     = 1,
      cumulative = FALSE,
      skip       = 0)
  
  X_train <- map(
    ml_data$splits, ~ analysis(.x) %>%
      as_tibble(., .name_repair = "universal") %>%
      select(starts_with("X")) %>% 
      as.matrix()
  )
  
  Y_train <- map(
    ml_data$splits, ~ analysis(.x) %>%
      as_tibble(., .name_repair = "universal") %>%
      select(starts_with("Y")) %>% 
      as.matrix()
  )
  
  X_test <- map(
    ml_data$splits, ~ assessment(.x) %>%
      as_tibble(., .name_repair = "universal") %>%
      select(starts_with("X")) %>% 
      as.matrix()
  )
  
  Y_test <- map(
    ml_data$splits, ~ assessment(.x) %>%
      as_tibble(., .name_repair = "universal") %>%
      select(starts_with("Y")) %>% 
      as.matrix()
  )
  
  #############################################################
  
  dtrain <- map2(
    X_train, Y_train, ~ xgb.DMatrix(data = .x, label = .y, missing = "NaN")
  )
  
  dtest <- map(
    X_test, ~ xgb.DMatrix(data = .x, missing = "NaN")
  )
  
  # Parameters:
  watchlist <- list("train" = dtrain)
  params <- list("eta" = 0.1, "max_depth" = 5, "colsample_bytree" = 1, "min_child_weight" = 1, "subsample"= 1,
                 "objective"="binary:logistic", "gamma" = 1, "lambda" = 1, "alpha" = 0, "max_delta_step" = 0,
                 "colsample_bylevel" = 1, "eval_metric"= "auc",
                 "set.seed" = 176)
  
  # Train the XGBoost model
  xgb.model <- map(
    dtrain, ~ xgboost(params = params, data = .x, nrounds = 10, watchlist)
  )
  
  xgb.pred <- map2(
    .x = xgb.model, 
    .y = dtest, 
    .f = ~ predict(.x, newdata = .y, type = 'prob')
  )
  
  preds <- cbind(plyr::ldply(xgb.pred, data.frame),
                 plyr::ldply(Y_test, data.frame)) %>% 
    setNames(c("pred_probs", "actual")) %>% 
    bind_cols(plyr::ldply(map(ml_data$splits, ~assessment(.x)))) %>% 
    rename(ID = duplicate_ID) %>% 
    #select(pred_probs, actual, date, ID, prc, ret) %>% 
    as_tibble()
  return(preds)
}

We can apply the above model to create the time series features, train and test on each of our assets by running the following.

Sys_t_start <- Sys.time()
Resultados <- lapply(seq(1:length(rolled_df)), Prediction_Model)
Sys_t_end <- Sys.time()
round(Sys_t_end - Sys_t_start, 2)

The Resultados output will give us a list the length of the number of assets we have in our data. The first few observations of the first asset in the list looks like:

pred_probs actual date prc ret Y open high low close ID X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
0.7304490 0 2018-03-15 73.17622 -0.0106061 0 75.36 75.51 74.18 74.63 CDW 0.9870653 0.1149955 0.7047308 1.277064 71 3.161538 63 3.245055 76 56
0.5149571 1 2018-03-16 74.35286 0.0160795 1 74.78 75.99 73.04 75.83 CDW 0.9886519 0.0745409 0.8408280 1.273320 70 3.143027 62 3.227452 75 55
0.6207952 0 2018-03-19 72.53889 -0.0243967 0 75.35 75.35 73.45 73.98 CDW 0.9902178 0.0901013 0.7192391 1.275344 69 3.153024 61 3.227452 74 55

Which consist of the XGBoost predicted probabilities, the actual observed result, the date of the result (of the out-of-sample testing data), the observed share price, the calculated daily returns, (a duplicate of the observed result), the OHLC data we collected from Yahoo and finally the time series features we constructed and then reneamed to \(X_{n}\)

The objective of this strategy was to invest every day in the asset which obtained the highest predicted probability that the market was going to go up. That is, if the model predicts on day \(t\) that asset GOOG was going to be higher than it’s previous close with a predicted probability of 0.78 and it also predicts that AMZN would go up with 0.53 probability then we would invest in GOOG today. That is, we only invest in the asset with the highest predicted probability that the market is going to go up.

Therefore, I create a new data frame called top_assets which basically gives me the highest predicted probability across all assets each day.

top_assets <- plyr::ldply(Resultados) %>% 
  #select(pred_probs, actual, date, open, high, low, close, prc, ret) %>% 
  group_by(date) %>%
  arrange(desc(pred_probs)) %>%
  dplyr::slice(1) %>% 
  ungroup() %>% 
  select(date, everything()) %>% 
  rename(score = pred_probs) %>% 
  select(-actual)

Strategy Assessment

The first 10 days of the strategy investments look like:

date score prc ret Y open high low close ID X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
2018-03-15 0.7304490 73.17622 -0.0106061 0 75.36 75.51 74.18 74.63 CDW 0.9870653 0.1149955 0.7047308 1.2770644 71 3.161538 63 3.245055 76 56
2018-03-16 0.6899720 293.04999 -0.0051601 0 295.13 297.61 289.27 293.05 ABMD 0.9918187 0.0769474 2.4417643 1.0584676 69 4.917599 62 7.861935 80 39
2018-03-19 0.7299674 101.46883 -0.0081591 0 108.98 109.06 107.40 108.19 CCI 0.9894902 0.0705445 0.3788407 0.9924987 68 1.786445 5 1.402126 89 46
2018-03-20 0.7370850 60.44966 0.0132920 1 65.13 66.10 65.13 65.56 SLB 0.9830999 0.1717591 0.1725298 1.1379607 53 1.853699 9 1.739650 10 47
2018-03-21 0.7003193 334.51999 0.0448199 1 320.94 339.20 320.32 334.52 CMG 0.9532525 0.0669860 2.4899030 1.6249110 67 4.775679 73 12.454513 10 57
2018-03-22 0.7438304 87.40306 -0.0243534 0 91.53 92.43 90.48 90.54 ADI 0.9796797 0.0855250 0.8606480 1.4206984 65 2.718067 64 8.589078 72 49
2018-03-23 0.6494384 237.70613 -0.0290578 0 253.63 253.99 244.93 245.26 GS 0.9685330 0.0615987 0.7610257 1.1137175 63 3.588928 58 4.667287 71 50
2018-03-26 0.6868502 70.18565 0.0238877 1 70.74 71.71 69.71 71.58 CDW 0.9885363 0.1109430 0.5278402 1.1776141 64 2.688307 56 3.038961 69 56
2018-03-27 0.7274348 57.21453 -0.0020772 0 58.13 58.41 57.16 57.65 CERN 0.9787359 0.1971365 0.2600627 1.0608221 62 2.325379 56 2.115355 69 52
2018-03-28 0.7031060 68.56778 -0.0039882 0 70.36 70.39 69.26 69.93 CDW 0.9869206 0.1328377 0.5033828 1.1567020 62 2.593677 54 3.010604 67 56

We can see that the column score is the probability for the asset with the highest predicted probability that it’s price was going to be greater than it’s previous close. The ID column gives us the asset ticker we invest in.

Next I want to analyse the strategy of picking the best predicted winners against the S&P500 bench mark and therefore download the S&P 500 index.

top_assets <- xts(top_assets[,c(2:ncol(top_assets))], order.by = top_assets$date) # put top_assets into xts format

# Analyse strategy
getSymbols("SPY", 
           from = start_date, 
           to = end_date, 
           src = "yahoo")

## [1] "SPY"

#detach("package:tsibble", unload = TRUE) # tsibble clashes with the base R index() function
SPY$ret_Rb <- Delt(SPY$SPY.Adjusted)
SPY <- SPY[index(SPY) >= min(index(top_assets))]

RaRb <- cbind(top_assets, SPY)

From here we can see how the strategy compares with the S&P 500. I show a number of statistics for analysing asset returns from the PerformanceAnalytics package. I have not expanded the model to include short selling or construct multi-asset portfolios of the top \(N\) assets.

We can plot the performance of our strategy:

charts.PerformanceSummary(RaRb[, c("ret", "ret_Rb")], geometric = FALSE, wealth.index = FALSE, 
                          main = "Strategy vs. Market")

Take a look at the drawdown and risk metrics.

##                             ret ret_Rb
## Sterling ratio           0.1870 0.3879
## Calmar ratio             0.2551 0.5884
## Burke ratio              0.2251 0.5344
## Pain index               0.0955 0.0283
## Ulcer index              0.1189 0.0455
## Pain ratio               0.7337 4.0290
## Martin ratio             0.5891 2.5027
## daily downside risk      0.0111 0.0066
## Annualised downside risk 0.1768 0.1044
## Downside potential       0.0054 0.0029
## Omega                    1.0722 1.1601
## Sortino ratio            0.0351 0.0714
## Upside potential         0.0058 0.0034
## Upside potential ratio   0.7027 0.6124
## Omega-sharpe ratio       0.0722 0.1601

Take a closer look at the drawdown information.

## $ret
##         From     Trough         To   Depth Length To Trough Recovery
## 1 2018-08-31 2019-01-03 2019-09-16 -0.2746    261        85      176
## 2 2019-11-06 2019-12-03       <NA> -0.1300     39        19       NA
## 3 2019-10-01 2019-10-18 2019-10-29 -0.0810     21        14        7
## 4 2018-03-22 2018-04-20 2018-05-09 -0.0773     34        21       13
## 5 2018-08-10 2018-08-15 2018-08-20 -0.0474      7         4        3
## 
## $ret_Rb
##         From     Trough         To   Depth Length To Trough Recovery
## 1 2018-09-21 2018-12-24 2019-04-12 -0.1935    140        65       75
## 2 2019-05-06 2019-06-03 2019-06-20 -0.0662     33        20       13
## 3 2018-03-15 2018-04-02 2018-06-04 -0.0610     56        12       44
## 4 2019-07-29 2019-08-05 2019-10-25 -0.0602     64         6       58
## 5 2018-06-13 2018-06-27 2018-07-09 -0.0300     18        11        7

Compare the returns.

chart.Boxplot(RaRb[,c("ret", "ret_Rb")],  main = "Returns")

Compare return statistics.

table.Stats(RaRb[, c("ret", "ret_Rb")]) %>% 
  t() %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Observations NAs Minimum Quartile 1 Median Arithmetic Mean Geometric Mean Quartile 3 Maximum SE Mean LCL Mean (0.95) UCL Mean (0.95) Variance Stdev Skewness Kurtosis
ret 453 0 -0.0669 -0.0068 0.0006 0.0004 0.0003 0.0087 0.0642 0.0007 -0.0011 0.0018 0.0002 0.0156 -0.2542 2.8842
ret_Rb 453 0 -0.0324 -0.0030 0.0006 0.0005 0.0004 0.0054 0.0505 0.0004 -0.0004 0.0013 0.0001 0.0091 -0.2949 3.6264

Compare Sharpe Information.

lapply(RaRb[, c("ret", "ret_Rb")], function(x){SharpeRatio(x)})

## $ret
##                                       ret
## StdDev Sharpe (Rf=0%, p=95%): 0.025027498
## VaR Sharpe (Rf=0%, p=95%):    0.015346462
## ES Sharpe (Rf=0%, p=95%):     0.009618405
## 
## $ret_Rb
##                                   ret_Rb
## StdDev Sharpe (Rf=0%, p=95%): 0.05152014
## VaR Sharpe (Rf=0%, p=95%):    0.03218952
## ES Sharpe (Rf=0%, p=95%):     0.01913213

Plot the Risk – Return scatter plot.

chart.RiskReturnScatter(RaRb[, c("ret", "ret_Rb")],  # check this plot a little more
                        Rf=.03/252, scale = 252, # for daily data
                        main = "Risk - Return over the period")

Plot the rolling return, risk and Sharpe performance.

charts.RollingPerformance(RaRb[, c("ret", "ret_Rb")],      
                          Rf=.03/12, 
                          colorset = c("red", rep("darkgray",5), "orange", "green"), lwd = 2)

Compute the yearly returns.

lapply(RaRb[, c("ret")],function(x){periodReturn(
  x, period = 'yearly', type = 'arithmetic')})        # change type to log for continuous

## $ret
##            yearly.returns
## 2018-12-31      -1.855083
## 2019-12-31      -1.475181

lapply(RaRb[, c("ret_Rb")],function(x){periodReturn(
  x, period = 'yearly', type = 'arithmetic')})

## $ret_Rb
##            yearly.returns
## 2018-12-31     -9.0376638
## 2019-12-31     -0.7226497

To leave a comment for the author, please follow the link and comment on their blog: Posts on Matthew Smith R Shenanigans.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)