Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this article, we executed a successful integration of a non-standard Python forecasting model into the R Tidyverse/Tidymodels framework, primarily leveraging the reticulate package.
1. The Objective (The Challenge)
The goal was to utilize a powerful, custom Python model (nnetsauce‘s MTS wrapper around a cyb$BoosterRegressor) and integrate its outputs—predictions and prediction intervals—into the R ecosystem, allowing for consistent data manipulation with dplyr and high-quality visualization using ggplot2 (and interactive plotly).
The challenge was that this custom Python model is not natively recognized by R’s standard forecasting packages like modeltime.
2. The Integration Mechanism
The integration was achieved through the following steps:
reticulateBridge: We used thereticulatepackage to seamlessly call the Python model, pass R data (tbl_train), fit the model (regr$fit()), and execute the forecast (regr$predict()).- Data Conversion: The raw forecast output (predictions and confidence intervals) from Python was carefully extracted and converted into a standard R tibble (
forecast_data_tbl). - Tidymodels Simulation (Initial Attempt): We attempted to force this tibble into a
modeltimestructure by using dummy models and calibration, which failed due to rigid internal validation checks inplot_modeltime_forecast(). - Tidyverse Final Solution: We abandoned the rigid
modeltimevisualization functions and adopted the core Tidyverse approach:- Data Manipulation: We used
dplyrandtidyr::pivot_longer()to restructure the forecast data into the long format, which is optimal forggplot2. - Visualization: We used pure
ggplot2andplotlyto manually draw the lines, ribbons, and apply the desired STOXX aesthetic, successfully bypassing all package incompatibility errors.
- Data Manipulation: We used
3. Key Findings
The process highlighted a critical statistical finding:
- Metric vs. Visual Discrepancy: The Python model’s point forecast quality was excellent (low MAPE), but its internal prediction interval calculation (
type_pi="gaussian") was severely flawed (high Out-of-Band Error). - Solution: To ensure the visualization reflected the low MAPE score, we calculated the true MAPE on the test set and manually generated a realistic, MAPE-based confidence interval to be used in the final plot.
In essence, we created a robust R-centric pipeline for post-processing, validation, and visualization of forecasts generated by a custom Python machine learning model.
library(tidyverse)
library(tidyquant)
library(reticulate)
library(tidymodels)
library(timetk)
library(modeltime)
library(plotly)
# nnetsauce_env ortamını kullanmasını söyleyin
use_python("C:/Users/selcu/AppData/Local/r-miniconda/envs/nnetsauce_env/python.exe", required = TRUE)
# Import Python packages
np <- import("numpy")
ns <- import("nnetsauce")
cyb <- import("cybooster")
sklearn <- import("sklearn")
#iShares US Aerospace & Defense ETF (ITA)
df_ita <-
tq_get("ITA") %>%
select(date, close) %>%
filter(date >= last(date) - months(36))
#Split Data
split <- time_series_split(df_ita,
assess = "15 days",
cumulative = TRUE)
tbl_train <- training(split)
tbl_test <- testing(split)
# 1. import DecisionTreeRegressor from sklearn
skl_tree <- import("sklearn.tree")
# Define the Base Estimator (The core learning component, factored for clarity)
# This uses a Decision Tree with a max depth of 3 as the weak learner for the boosting model.
base_estimator_instance <- skl_tree$DecisionTreeRegressor(max_depth = 3L)
# Define the Booster Object (The main regression engine)
# cyb$BoosterRegressor is the ensemble model that will be fitted.
booster_regressor <- cyb$BoosterRegressor(
base_estimator_instance, # The weak learner used for boosting
n_estimators = 100L # Number of boosting stages (number of trees to build)
)
# Define the MTS (Multi-variate Time Series) Model Wrapper
# ns$MTS wraps the booster model to handle time series features like lags and prediction intervals.
regr <- ns$MTS(
obj = booster_regressor, # The fitted regression model (BoosterRegressor)
# Time Series Parameters:
lags = 20L, # Number of past observations (lags) to use as predictors (X)
type_pi = "gaussian", # Method for computing Prediction Intervals (PI).
# 'gaussian' assumes errors are normally distributed.
# (NOTE: We found this to be too narrow, requiring manual adjustment with MAPE.)
# Execution Control:
show_progress = TRUE # Displays the training progress (useful for monitoring)
)
#Converting the tibble to data frame for the fitting process
df_train <-
tbl_train %>%
as.data.frame() %>%
mutate(date = as.character(date))
df <- df_train[, -1, drop = FALSE]
rownames(df) <- df_train$date
#Fit the model
regr$fit(df)
#Step 1: Extract Python Predictions and Create the Base R Data Table
# --- Define Python Function and Retrieve Predictions ---
# (Ensure this function has been run in your R session previously.)
reticulate::py_run_string("
def get_full_data_11(preds_obj):
mean_list = preds_obj[0]['close'].tolist()
lower_list = preds_obj[1]['close'].tolist()
upper_list = preds_obj[2]['close'].tolist()
return {'mean': mean_list, 'lower': lower_list, 'upper': upper_list}
")
preds_raw <- regr$predict(h = 11L)
full_data_list <- py$get_full_data_11(preds_raw)
preds_mean_vector <- unlist(full_data_list$mean)
preds_lower_vector <- unlist(full_data_list$lower)
preds_upper_vector <- unlist(full_data_list$upper)
# Align data sets based on prediction length
vector_length <- length(preds_mean_vector)
df_test_final <- tbl_test %>% dplyr::slice(1:vector_length)
# --- Create the Base Forecast Table ---
forecast_data_tbl <- tibble(
.index = df_test_final$date,
.value = df_test_final$close,
.prediction = preds_mean_vector,
.conf_lo = preds_lower_vector, # These are the initially incorrect (too narrow) CIs
.conf_hi = preds_upper_vector, # These are the initially incorrect (too narrow) CIs
.model_desc = "nnetsauce_BoosterRegressor"
)
#Step 2: Calculate the Actual MAPE and Create Realistic Confidence Intervals
library(dplyr)
library(tibble)
library(tidyr)
# --- Step 2.1: Calculate the Actual MAPE on the Test Set ---
# Create a verification table containing error metrics
check_tbl <- forecast_data_tbl %>%
mutate(
Error = .value - .prediction,
# MAPE formula: abs(Actual - Prediction) / Actual * 100
Absolute_Percentage_Error = abs(Error) / .value * 100
)
# Actual MAPE (Calculated from the Test Set)
actual_mape_test <- mean(check_tbl$Absolute_Percentage_Error, na.rm = TRUE) / 100
# (Converting the percentage to decimal format: e.g., 2.84% -> 0.0284)
print(paste0("Calculated MAPE (decimal format): ", actual_mape_test))
# --- Step 2.2: Create New, Realistic Intervals Based on MAPE ---
Z_SCORE_95 <- 1.96 # Z-Score for 95% Confidence Interval
# Create the New Realistic Prediction Interval Table
forecast_data_tbl_realistic <- check_tbl %>%
mutate(
# Error Margin = Prediction * MAPE * Z-Score
Error_Amount = .prediction * actual_mape_test * Z_SCORE_95,
# New Confidence Intervals
.conf_lo_new = .prediction - Error_Amount,
.conf_hi_new = .prediction + Error_Amount
) %>%
# Rename/replace columns to use the new, wider intervals
mutate(
.conf_lo = .conf_lo_new,
.conf_hi = .conf_hi_new
) %>%
select(-c(Error_Amount, .conf_lo_new, .conf_hi_new, Error, Absolute_Percentage_Error))
#Visualize Confidence Intervals
# Visualization Settings
min_y <- min(forecast_data_tbl_realistic$.conf_lo) * 0.98
max_y <- max(forecast_data_tbl_realistic$.conf_hi) * 1.02
mape_value_for_title <- round(actual_mape_test * 100, 2)
COLOR_PREDICTION <- "red" # Dark Grey/Blue (Prediction)
COLOR_ACTUAL <- "black" # Red (Actual Value)
COLOR_RIBBON <- "dimgrey"
# Pivot the Base Table to Long Format (required for ggplot)
forecast_tbl_long_final <- forecast_data_tbl_realistic %>%
pivot_longer(cols = c(.value, .prediction),
names_to = ".key",
values_to = "Y_Value") %>%
mutate(
.conf_lo_plot = ifelse(.key == ".prediction", .conf_lo, NA_real_),
.conf_hi_plot = ifelse(.key == ".prediction", .conf_hi, NA_real_)
)
# Create the ggplot Object
p <- ggplot(forecast_tbl_long_final, aes(x = .index, y = Y_Value, color = .key)) +
# Draw the Confidence Interval (Ribbon)
geom_ribbon(aes(ymin = .conf_lo_plot, ymax = .conf_hi_plot),
fill = COLOR_RIBBON,
alpha = 0.6,
color = NA,
data = forecast_tbl_long_final %>% filter(.key == ".prediction")) +
# Draw the Lines
geom_line(linewidth = 2.0) +
# Y-Axis Zoom
coord_cartesian(ylim = c(min_y, max_y)) +
# Colors and Titles
labs(title = "iShares US Aerospace & Defense ETF",
subtitle = paste0("<span style='color:dimgrey;'>Predictive Intervals</span> based on TEST SET MAPE: ", mape_value_for_title, "% <br><span style='color:red;'>nnetsauce-Wrapped Booster Regressor (MTS) Model</span>"),
x = "",
y = "",
color = "") +
scale_color_manual(values = c(".value" = COLOR_ACTUAL,
".prediction" = COLOR_PREDICTION),
labels = c(".value" = "Actual Value",
".prediction" = "Prediction")) +
scale_y_continuous(labels = scales::label_currency()) +
scale_x_date(labels = scales::label_date("%b %d"),
date_breaks = "2 days") +
theme_minimal(base_family = "Roboto Slab", base_size = 16) +
theme(plot.title = element_text(face = "bold", size = 16),
plot.subtitle = ggtext::element_markdown(face = "bold"),
plot.background = element_rect(fill = "azure", color = "azure"),
panel.background = element_rect(fill = "snow", color = "snow"),
axis.text = element_text(face = "bold"),
axis.text.x = element_text(angle = 45,
hjust = 1,
vjust = 1),
legend.position = "none")
# Interactive Output
plotly::ggplotly(p)
NOTE: This article was generated with the assistance of an AI model. The final content and structure were reviewed and approved by the author.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
