Site icon R-bloggers

Taming Volatility: High-Performance Forecasting of the STOXX 600 with H2O AutoML

[This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Forecasting financial markets, such as the STOXX Europe 600 Index, presents a classic Machine Learning challenge: the data is inherently noisy, non-stationary, and highly susceptible to sudden market events. To tackle this, we turn to Automated Machine Learning (AutoML)—specifically the powerful, scalable framework provided by H2O.ai and integrated into the R modeltime ecosystem.

This article dissects a full MLOps workflow, from data acquisition and feature engineering to model training and evaluation, revealing how a high-performance, low-variance model triumphed over the market’s volatility.

1. The Forecasting Pipeline: Building a Feature-Rich Model

The core strategy involved converting the univariate time series problem into a supervised regression problem by generating powerful explanatory variables.

A. Data & Splitting

#Install Development Version of modeltime.h2o
devtools::install_github("business-science/modeltime.h2o", force = TRUE)

library(tidymodels)
library(modeltime.h2o)
library(tidyverse)
library(timetk)

#STOXX Europe 600
df_stoxx <- 
  tq_get("^STOXX", to = "2025-10-31") %>% 
  select(date, stoxx = close) %>% 
  mutate(id = "id") %>% 
  filter(date >= last(date) - months(12)) %>% 
  drop_na()


#Train/Test Splitting
splits <-  
  df_stoxx %>% 
  time_series_split(
    assess     = "15 days", 
    cumulative = TRUE
  )

B. Feature Engineering (The Recipe)

A robust feature recipe (rec_spec) was designed to capture both time dependence and seasonality:

#Preprocessed data/Feature engineering
rec_spec <- 
  recipe(stoxx ~ date, data = training(splits)) %>% 
  step_timeseries_signature(date) %>% 
  step_lag(stoxx, lag = 1:2) %>% 
  step_fourier(date, period = 365.25, K = 1) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% 
  step_zv(all_predictors()) %>% 
  step_naomit(all_predictors())

#Train 
train_tbl <- 
  rec_spec %>% 
  prep() %>% 
  bake(training(splits))

#Test
test_tbl  <- 
  rec_spec %>% 
  prep() %>% 
  bake(testing(splits))

2. AutoML Execution: The Race Against the Clock

We initiated the H2O AutoML process using automl_reg() under strict resource constraints to quickly identify the most promising model type:

ParameterValueRationale
max_runtime_secs5Time limit for the entire process.
max_models3Limit on the number of base models to train.
exclude_algos"DeepLearning"Excluding computationally expensive models for rapid prototyping.
#Initialize H2O
h2o.init(
  nthreads = -1,
  ip       = 'localhost',
  port     = 54321
)



#Model specification and fitting
model_spec <- automl_reg(mode = 'regression') %>%
  set_engine(
    engine                     = 'h2o',
    max_runtime_secs           = 5, 
    max_runtime_secs_per_model = 3,
    max_models                 = 3,
    nfolds                     = 5,
    exclude_algos              = c("DeepLearning"),
    verbosity                  = NULL,
    seed                       = 98765
  ) 


model_fitted <- 
  model_spec %>%
  fit(stoxx ~ ., data = train_tbl)

These tight constraints resulted in a leaderboard featuring only the fastest and highest-performing base algorithms:

RankModel IDAlgorithmCross-Validation RMSE
1DRF_1_AutoML…Distributed Random Forest3.99
2GBM_2_AutoML…Gradient Boosting Machine4.20
3GLM_1_AutoML…Generalized Linear Model5.50
#Evaluation
model_fitted %>% 
  automl_leaderboard()

3. The Winner: Distributed Random Forest (DRF)

The Distributed Random Forest (DRF) emerged as the leader in the cross-validation phase, demonstrating superior generalization ability with the lowest Root Mean Squared Error (RMSE) of 3.99.

Why DRF Won: The Low Variance Advantage

The DRF model’s victory over the generally higher-accuracy Gradient Boosting Machine (GBM) is a powerful illustration of the Bias-Variance Trade-off in noisy data:

Test Set Performance

Calibrating the leading DRF model on the final 15-day test set confirmed its strong performance:

MetricDRF Test Set ValueInterpretation
RMSE10.9A jump from the training RMSE (3.99), typical of non-stationary financial data, but remains a strong result for market prediction.
R-Squared0.537The model explains over 53% of the variance in the unseen test data.
#Modeltime Table
model_tbl <- 
  modeltime_table(
    model_fitted
  )


#Calibration to test data
calib_tbl <- 
  model_tbl %>%
  modeltime_calibrate(
    new_data = test_tbl
  )

#Measure Test Accuracy
calib_tbl %>% 
  modeltime_accuracy()

Finally, we can construct predictive intervals, which are used as a kind of Relative Strength Index (RSI) in this context.

#Prediction Intervals
calib_tbl %>%
  modeltime_forecast(
    new_data    = test_tbl,
    actual_data = test_tbl
  ) %>%
  plot_modeltime_forecast(
    .interactive = FALSE,
    .line_size = 1.5
  )  +
  labs(title = "Modeling with Automated ML for the STOXX Europe 600", 
       subtitle = "<span style = 'color:dimgrey;'>Predictive Intervals</span> of <span style = 'color:red;'>Distributed Random Forest</span> Model", 
       y = "", 
       x = "") + 
  scale_y_continuous(labels = scales::label_currency(prefix = "€")) +
  scale_x_date(labels = scales::label_date("%b %d"),
               date_breaks = "2 days") +
  theme_minimal(base_family = "Roboto Slab", base_size = 16) +
  theme(plot.title = element_text(face = "bold", size = 16),
        plot.subtitle = ggtext::element_markdown(face = "bold"),
        plot.background = element_rect(fill = "azure", color = "azure"),
        panel.background = element_rect(fill = "snow", color = "snow"),
        axis.text = element_text(face = "bold"),
        axis.text.x = element_text(angle = 45, 
                                   hjust = 1, 
                                   vjust = 1),
        legend.position = "none")

NOTE: This article was generated with the support of an AI assistant. The final content and structure were reviewed and approved by the author.

To leave a comment for the author, please follow the link and comment on their blog: DataGeeek.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version