Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this article, we executed a successful integration of a non-standard Python forecasting model into the R Tidyverse/Tidymodels framework, primarily leveraging the reticulate
package.
1. The Objective (The Challenge)
The goal was to utilize a powerful, custom Python model (nnetsauce
‘s MTS wrapper around a cyb$BoosterRegressor
) and integrate its outputs—predictions and prediction intervals—into the R ecosystem, allowing for consistent data manipulation with dplyr
and high-quality visualization using ggplot2
(and interactive plotly
).
The challenge was that this custom Python model is not natively recognized by R’s standard forecasting packages like modeltime
.
2. The Integration Mechanism
The integration was achieved through the following steps:
reticulate
Bridge: We used thereticulate
package to seamlessly call the Python model, pass R data (tbl_train
), fit the model (regr$fit()
), and execute the forecast (regr$predict()
).- Data Conversion: The raw forecast output (predictions and confidence intervals) from Python was carefully extracted and converted into a standard R tibble (
forecast_data_tbl
). - Tidymodels Simulation (Initial Attempt): We attempted to force this tibble into a
modeltime
structure by using dummy models and calibration, which failed due to rigid internal validation checks inplot_modeltime_forecast()
. - Tidyverse Final Solution: We abandoned the rigid
modeltime
visualization functions and adopted the core Tidyverse approach:- Data Manipulation: We used
dplyr
andtidyr::pivot_longer()
to restructure the forecast data into the long format, which is optimal forggplot2
. - Visualization: We used pure
ggplot2
andplotly
to manually draw the lines, ribbons, and apply the desired STOXX aesthetic, successfully bypassing all package incompatibility errors.
- Data Manipulation: We used
3. Key Findings
The process highlighted a critical statistical finding:
- Metric vs. Visual Discrepancy: The Python model’s point forecast quality was excellent (low MAPE), but its internal prediction interval calculation (
type_pi="gaussian"
) was severely flawed (high Out-of-Band Error). - Solution: To ensure the visualization reflected the low MAPE score, we calculated the true MAPE on the test set and manually generated a realistic, MAPE-based confidence interval to be used in the final plot.
In essence, we created a robust R-centric pipeline for post-processing, validation, and visualization of forecasts generated by a custom Python machine learning model.
library(tidyverse) library(tidyquant) library(reticulate) library(tidymodels) library(timetk) library(modeltime) library(plotly) # nnetsauce_env ortamını kullanmasını söyleyin use_python("C:/Users/selcu/AppData/Local/r-miniconda/envs/nnetsauce_env/python.exe", required = TRUE) # Import Python packages np <- import("numpy") ns <- import("nnetsauce") cyb <- import("cybooster") sklearn <- import("sklearn") #iShares US Aerospace & Defense ETF (ITA) df_ita <- tq_get("ITA") %>% select(date, close) %>% filter(date >= last(date) - months(36)) #Split Data split <- time_series_split(df_ita, assess = "15 days", cumulative = TRUE) tbl_train <- training(split) tbl_test <- testing(split) # 1. import DecisionTreeRegressor from sklearn skl_tree <- import("sklearn.tree") # Define the Base Estimator (The core learning component, factored for clarity) # This uses a Decision Tree with a max depth of 3 as the weak learner for the boosting model. base_estimator_instance <- skl_tree$DecisionTreeRegressor(max_depth = 3L) # Define the Booster Object (The main regression engine) # cyb$BoosterRegressor is the ensemble model that will be fitted. booster_regressor <- cyb$BoosterRegressor( base_estimator_instance, # The weak learner used for boosting n_estimators = 100L # Number of boosting stages (number of trees to build) ) # Define the MTS (Multi-variate Time Series) Model Wrapper # ns$MTS wraps the booster model to handle time series features like lags and prediction intervals. regr <- ns$MTS( obj = booster_regressor, # The fitted regression model (BoosterRegressor) # Time Series Parameters: lags = 20L, # Number of past observations (lags) to use as predictors (X) type_pi = "gaussian", # Method for computing Prediction Intervals (PI). # 'gaussian' assumes errors are normally distributed. # (NOTE: We found this to be too narrow, requiring manual adjustment with MAPE.) # Execution Control: show_progress = TRUE # Displays the training progress (useful for monitoring) ) #Converting the tibble to data frame for the fitting process df_train <- tbl_train %>% as.data.frame() %>% mutate(date = as.character(date)) df <- df_train[, -1, drop = FALSE] rownames(df) <- df_train$date #Fit the model regr$fit(df) #Step 1: Extract Python Predictions and Create the Base R Data Table # --- Define Python Function and Retrieve Predictions --- # (Ensure this function has been run in your R session previously.) reticulate::py_run_string(" def get_full_data_11(preds_obj): mean_list = preds_obj[0]['close'].tolist() lower_list = preds_obj[1]['close'].tolist() upper_list = preds_obj[2]['close'].tolist() return {'mean': mean_list, 'lower': lower_list, 'upper': upper_list} ") preds_raw <- regr$predict(h = 11L) full_data_list <- py$get_full_data_11(preds_raw) preds_mean_vector <- unlist(full_data_list$mean) preds_lower_vector <- unlist(full_data_list$lower) preds_upper_vector <- unlist(full_data_list$upper) # Align data sets based on prediction length vector_length <- length(preds_mean_vector) df_test_final <- tbl_test %>% dplyr::slice(1:vector_length) # --- Create the Base Forecast Table --- forecast_data_tbl <- tibble( .index = df_test_final$date, .value = df_test_final$close, .prediction = preds_mean_vector, .conf_lo = preds_lower_vector, # These are the initially incorrect (too narrow) CIs .conf_hi = preds_upper_vector, # These are the initially incorrect (too narrow) CIs .model_desc = "nnetsauce_BoosterRegressor" ) #Step 2: Calculate the Actual MAPE and Create Realistic Confidence Intervals library(dplyr) library(tibble) library(tidyr) # --- Step 2.1: Calculate the Actual MAPE on the Test Set --- # Create a verification table containing error metrics check_tbl <- forecast_data_tbl %>% mutate( Error = .value - .prediction, # MAPE formula: abs(Actual - Prediction) / Actual * 100 Absolute_Percentage_Error = abs(Error) / .value * 100 ) # Actual MAPE (Calculated from the Test Set) actual_mape_test <- mean(check_tbl$Absolute_Percentage_Error, na.rm = TRUE) / 100 # (Converting the percentage to decimal format: e.g., 2.84% -> 0.0284) print(paste0("Calculated MAPE (decimal format): ", actual_mape_test)) # --- Step 2.2: Create New, Realistic Intervals Based on MAPE --- Z_SCORE_95 <- 1.96 # Z-Score for 95% Confidence Interval # Create the New Realistic Prediction Interval Table forecast_data_tbl_realistic <- check_tbl %>% mutate( # Error Margin = Prediction * MAPE * Z-Score Error_Amount = .prediction * actual_mape_test * Z_SCORE_95, # New Confidence Intervals .conf_lo_new = .prediction - Error_Amount, .conf_hi_new = .prediction + Error_Amount ) %>% # Rename/replace columns to use the new, wider intervals mutate( .conf_lo = .conf_lo_new, .conf_hi = .conf_hi_new ) %>% select(-c(Error_Amount, .conf_lo_new, .conf_hi_new, Error, Absolute_Percentage_Error)) #Visualize Confidence Intervals # Visualization Settings min_y <- min(forecast_data_tbl_realistic$.conf_lo) * 0.98 max_y <- max(forecast_data_tbl_realistic$.conf_hi) * 1.02 mape_value_for_title <- round(actual_mape_test * 100, 2) COLOR_PREDICTION <- "red" # Dark Grey/Blue (Prediction) COLOR_ACTUAL <- "black" # Red (Actual Value) COLOR_RIBBON <- "dimgrey" # Pivot the Base Table to Long Format (required for ggplot) forecast_tbl_long_final <- forecast_data_tbl_realistic %>% pivot_longer(cols = c(.value, .prediction), names_to = ".key", values_to = "Y_Value") %>% mutate( .conf_lo_plot = ifelse(.key == ".prediction", .conf_lo, NA_real_), .conf_hi_plot = ifelse(.key == ".prediction", .conf_hi, NA_real_) ) # Create the ggplot Object p <- ggplot(forecast_tbl_long_final, aes(x = .index, y = Y_Value, color = .key)) + # Draw the Confidence Interval (Ribbon) geom_ribbon(aes(ymin = .conf_lo_plot, ymax = .conf_hi_plot), fill = COLOR_RIBBON, alpha = 0.6, color = NA, data = forecast_tbl_long_final %>% filter(.key == ".prediction")) + # Draw the Lines geom_line(linewidth = 2.0) + # Y-Axis Zoom coord_cartesian(ylim = c(min_y, max_y)) + # Colors and Titles labs(title = "iShares US Aerospace & Defense ETF", subtitle = paste0("<span style='color:dimgrey;'>Predictive Intervals</span> based on TEST SET MAPE: ", mape_value_for_title, "% <br><span style='color:red;'>nnetsauce-Wrapped Booster Regressor (MTS) Model</span>"), x = "", y = "", color = "") + scale_color_manual(values = c(".value" = COLOR_ACTUAL, ".prediction" = COLOR_PREDICTION), labels = c(".value" = "Actual Value", ".prediction" = "Prediction")) + scale_y_continuous(labels = scales::label_currency()) + scale_x_date(labels = scales::label_date("%b %d"), date_breaks = "2 days") + theme_minimal(base_family = "Roboto Slab", base_size = 16) + theme(plot.title = element_text(face = "bold", size = 16), plot.subtitle = ggtext::element_markdown(face = "bold"), plot.background = element_rect(fill = "azure", color = "azure"), panel.background = element_rect(fill = "snow", color = "snow"), axis.text = element_text(face = "bold"), axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), legend.position = "none") # Interactive Output plotly::ggplotly(p)
NOTE: This article was generated with the assistance of an AI model. The final content and structure were reviewed and approved by the author.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.