Integrating Python Forecasting with R’s Tidyverse

[This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this article, we executed a successful integration of a non-standard Python forecasting model into the R Tidyverse/Tidymodels framework, primarily leveraging the reticulate package.

1. The Objective (The Challenge)

The goal was to utilize a powerful, custom Python model (nnetsauce‘s MTS wrapper around a cyb$BoosterRegressor) and integrate its outputs—predictions and prediction intervals—into the R ecosystem, allowing for consistent data manipulation with dplyr and high-quality visualization using ggplot2 (and interactive plotly).

The challenge was that this custom Python model is not natively recognized by R’s standard forecasting packages like modeltime.

2. The Integration Mechanism

The integration was achieved through the following steps:

  • reticulate Bridge: We used the reticulate package to seamlessly call the Python model, pass R data (tbl_train), fit the model (regr$fit()), and execute the forecast (regr$predict()).
  • Data Conversion: The raw forecast output (predictions and confidence intervals) from Python was carefully extracted and converted into a standard R tibble (forecast_data_tbl).
  • Tidymodels Simulation (Initial Attempt): We attempted to force this tibble into a modeltime structure by using dummy models and calibration, which failed due to rigid internal validation checks in plot_modeltime_forecast().
  • Tidyverse Final Solution: We abandoned the rigid modeltime visualization functions and adopted the core Tidyverse approach:
    1. Data Manipulation: We used dplyr and tidyr::pivot_longer() to restructure the forecast data into the long format, which is optimal for ggplot2.
    2. Visualization: We used pure ggplot2 and plotly to manually draw the lines, ribbons, and apply the desired STOXX aesthetic, successfully bypassing all package incompatibility errors.

3. Key Findings

The process highlighted a critical statistical finding:

  • Metric vs. Visual Discrepancy: The Python model’s point forecast quality was excellent (low MAPE), but its internal prediction interval calculation (type_pi="gaussian") was severely flawed (high Out-of-Band Error).
  • Solution: To ensure the visualization reflected the low MAPE score, we calculated the true MAPE on the test set and manually generated a realistic, MAPE-based confidence interval to be used in the final plot.

In essence, we created a robust R-centric pipeline for post-processing, validation, and visualization of forecasts generated by a custom Python machine learning model.

library(tidyverse)
library(tidyquant)
library(reticulate)
library(tidymodels)
library(timetk)
library(modeltime)
library(plotly)

# nnetsauce_env ortamını kullanmasını söyleyin
use_python("C:/Users/selcu/AppData/Local/r-miniconda/envs/nnetsauce_env/python.exe", required = TRUE) 


# Import Python packages
np <- import("numpy")
ns <- import("nnetsauce")
cyb <- import("cybooster")
sklearn <- import("sklearn")


#iShares US Aerospace & Defense ETF (ITA)
df_ita <- 
  tq_get("ITA") %>% 
  select(date, close) %>% 
  filter(date >= last(date) - months(36))

#Split Data 
split <- time_series_split(df_ita,
                           assess = "15 days",
                           cumulative = TRUE)


tbl_train <- training(split)
tbl_test <- testing(split)

# 1. import DecisionTreeRegressor from sklearn
skl_tree <- import("sklearn.tree")

# Define the Base Estimator (The core learning component, factored for clarity)
# This uses a Decision Tree with a max depth of 3 as the weak learner for the boosting model.
base_estimator_instance <- skl_tree$DecisionTreeRegressor(max_depth = 3L)

# Define the Booster Object (The main regression engine)
# cyb$BoosterRegressor is the ensemble model that will be fitted.
booster_regressor <- cyb$BoosterRegressor(
  base_estimator_instance, # The weak learner used for boosting
  n_estimators = 100L        # Number of boosting stages (number of trees to build)
)

# Define the MTS (Multi-variate Time Series) Model Wrapper
# ns$MTS wraps the booster model to handle time series features like lags and prediction intervals.
regr <- ns$MTS(
  obj = booster_regressor, # The fitted regression model (BoosterRegressor)
  
  # Time Series Parameters:
  lags = 20L,              # Number of past observations (lags) to use as predictors (X)
  type_pi = "gaussian",    # Method for computing Prediction Intervals (PI). 
  # 'gaussian' assumes errors are normally distributed.
  # (NOTE: We found this to be too narrow, requiring manual adjustment with MAPE.)
  
  # Execution Control:
  show_progress = TRUE     # Displays the training progress (useful for monitoring)
)


#Converting the tibble to data frame for the fitting process
df_train <- 
  tbl_train %>% 
  as.data.frame() %>% 
  mutate(date = as.character(date))

df <- df_train[, -1, drop = FALSE]
rownames(df) <- df_train$date


#Fit the model
regr$fit(df)

#Step 1: Extract Python Predictions and Create the Base R Data Table

# --- Define Python Function and Retrieve Predictions ---
# (Ensure this function has been run in your R session previously.)
reticulate::py_run_string("
def get_full_data_11(preds_obj):
    mean_list = preds_obj[0]['close'].tolist() 
    lower_list = preds_obj[1]['close'].tolist()
    upper_list = preds_obj[2]['close'].tolist()
    return {'mean': mean_list, 'lower': lower_list, 'upper': upper_list}
")

preds_raw <- regr$predict(h = 11L)
full_data_list <- py$get_full_data_11(preds_raw)

preds_mean_vector  <- unlist(full_data_list$mean)
preds_lower_vector <- unlist(full_data_list$lower)
preds_upper_vector <- unlist(full_data_list$upper)

# Align data sets based on prediction length
vector_length <- length(preds_mean_vector)
df_test_final <- tbl_test %>% dplyr::slice(1:vector_length) 

# --- Create the Base Forecast Table ---
forecast_data_tbl <- tibble(
  .index = df_test_final$date,
  .value = df_test_final$close,
  .prediction = preds_mean_vector,
  .conf_lo    = preds_lower_vector, # These are the initially incorrect (too narrow) CIs
  .conf_hi    = preds_upper_vector, # These are the initially incorrect (too narrow) CIs
    .model_desc = "nnetsauce_BoosterRegressor"
)

#Step 2: Calculate the Actual MAPE and Create Realistic Confidence Intervals
library(dplyr)
library(tibble)
library(tidyr) 

# --- Step 2.1: Calculate the Actual MAPE on the Test Set ---

# Create a verification table containing error metrics
check_tbl <- forecast_data_tbl %>%
  mutate(
    Error = .value - .prediction,
    # MAPE formula: abs(Actual - Prediction) / Actual * 100
    Absolute_Percentage_Error = abs(Error) / .value * 100 
  )

# Actual MAPE (Calculated from the Test Set)
actual_mape_test <- mean(check_tbl$Absolute_Percentage_Error, na.rm = TRUE) / 100 
# (Converting the percentage to decimal format: e.g., 2.84% -> 0.0284)

print(paste0("Calculated MAPE (decimal format): ", actual_mape_test))


# --- Step 2.2: Create New, Realistic Intervals Based on MAPE ---

Z_SCORE_95 <- 1.96 # Z-Score for 95% Confidence Interval

# Create the New Realistic Prediction Interval Table
forecast_data_tbl_realistic <- check_tbl %>%
  mutate(
    # Error Margin = Prediction * MAPE * Z-Score
    Error_Amount = .prediction * actual_mape_test * Z_SCORE_95,
    
    # New Confidence Intervals
    .conf_lo_new = .prediction - Error_Amount, 
    .conf_hi_new = .prediction + Error_Amount
  ) %>%
  # Rename/replace columns to use the new, wider intervals
  mutate(
    .conf_lo = .conf_lo_new,
    .conf_hi = .conf_hi_new
  ) %>%
  select(-c(Error_Amount, .conf_lo_new, .conf_hi_new, Error, Absolute_Percentage_Error))

#Visualize Confidence Intervals

# Visualization Settings
min_y <- min(forecast_data_tbl_realistic$.conf_lo) * 0.98 
max_y <- max(forecast_data_tbl_realistic$.conf_hi) * 1.02 
mape_value_for_title <- round(actual_mape_test * 100, 2)
COLOR_PREDICTION <- "red" # Dark Grey/Blue (Prediction)
COLOR_ACTUAL <- "black" # Red (Actual Value)
COLOR_RIBBON <- "dimgrey" 

# Pivot the Base Table to Long Format (required for ggplot)
forecast_tbl_long_final <- forecast_data_tbl_realistic %>%
  pivot_longer(cols = c(.value, .prediction), 
               names_to = ".key", 
               values_to = "Y_Value") %>%
  mutate(
    .conf_lo_plot = ifelse(.key == ".prediction", .conf_lo, NA_real_),
    .conf_hi_plot = ifelse(.key == ".prediction", .conf_hi, NA_real_)
  )

# Create the ggplot Object
p <- ggplot(forecast_tbl_long_final, aes(x = .index, y = Y_Value, color = .key)) +
  
  # Draw the Confidence Interval (Ribbon)
  geom_ribbon(aes(ymin = .conf_lo_plot, ymax = .conf_hi_plot),
              fill = COLOR_RIBBON, 
              alpha = 0.6, 
              color = NA,
              data = forecast_tbl_long_final %>% filter(.key == ".prediction")) +
  
  # Draw the Lines
  geom_line(linewidth = 2.0) +
  
  # Y-Axis Zoom
  coord_cartesian(ylim = c(min_y, max_y)) + 
  
  # Colors and Titles
  labs(title = "iShares US Aerospace & Defense ETF",
       subtitle = paste0("<span style='color:dimgrey;'>Predictive Intervals</span> based on TEST SET MAPE: ", mape_value_for_title, "% <br><span style='color:red;'>nnetsauce-Wrapped Booster Regressor (MTS) Model</span>"),
       x = "", 
       y = "",
       color = "") + 
  
  scale_color_manual(values = c(".value" = COLOR_ACTUAL, 
                                ".prediction" = COLOR_PREDICTION),
                     labels = c(".value" = "Actual Value", 
                                ".prediction" = "Prediction")) +
  scale_y_continuous(labels = scales::label_currency()) +
  scale_x_date(labels = scales::label_date("%b %d"),
               date_breaks = "2 days") +
  theme_minimal(base_family = "Roboto Slab", base_size = 16) +
  theme(plot.title = element_text(face = "bold", size = 16),
        plot.subtitle = ggtext::element_markdown(face = "bold"),
        plot.background = element_rect(fill = "azure", color = "azure"),
        panel.background = element_rect(fill = "snow", color = "snow"),
        axis.text = element_text(face = "bold"),
        axis.text.x = element_text(angle = 45, 
                                   hjust = 1, 
                                   vjust = 1),
        legend.position = "none")


# Interactive Output
plotly::ggplotly(p)

NOTE: This article was generated with the assistance of an AI model. The final content and structure were reviewed and approved by the author.

To leave a comment for the author, please follow the link and comment on their blog: DataGeeek.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)