Site icon R-bloggers

Enhancing Time Series Forecasting (ahead::ridge2f) with Attention-Based Context Vectors (ahead::contextridge2f)

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In this post, I’ll introduce ahead::contextridge2f(), a novel forecasting function that combines doubly-constrained Random Vector Functional Link (RVFL) networks with attention-based context vectors with the aim to improve prediction accuracy.

The Core Idea

The key insight is simple but powerful: not all past observations are equally relevant for predicting the future. An attention mechanism learns to assign different weights to historical values based on their relevance to the current time point.

Instead of treating the time series as a simple sequence, we compute context vectors—weighted summaries of the historical data where the weights are determined by an attention mechanism. These context vectors then serve as external regressors in a doubly-constrained Random Vector Functional Link (RVFL) network.

What is Doubly-Constrained RVFL?

RVFL networks, as implemented in ridge2f() (Moudiki et al., 2018), are a type of randomized neural network that:

  1. Use random or quasi-random hidden layer weights that are not trained (computational efficiency)
  2. Include direct input-to-output connections (preserves linear relationships)
  3. Apply dual constraints via ridge penalties on both:
    • Direct connections (λ₁)
    • Hidden layer outputs (λ₂)

This architecture combines the expressiveness of neural networks with the simplicity and speed of linear models, making it particularly well-suited for time series forecasting.

What Are Context Vectors?

A context vector at time t is a weighted sum of all previous observations:

context[t] = Σ(attention_weight[t,j] × series[j]) for j ≤ t

Where attention_weight[t,j] represents how much time point j contributes to our understanding of time t.

Different attention mechanisms produce different weighting schemes:

The Function: ahead::contextridge2f()

Here’s the implementation:

contextridge2f <- function(y,
                           h = 5L,
                           split_fraction = 0.8,
                           attention_type = "exponential",
                           window_size = 3,
                           decay_factor = 5.0,
                           temperature = 1.0,
                           sigma = 1.0,
                           sensitivity = 1.0,
                           alpha = 0.5,
                           beta = 0.5,
                           ...)
{
  ctx_result <- computeattention(
    series = y,
    attention_type = attention_type,
    window_size = window_size,
    decay_factor = decay_factor,
    temperature = temperature,
    sigma = sigma,
    sensitivity = sensitivity,
    alpha = alpha,
    beta = beta
  )
  
  return(ahead::ridge2f(
    y = y,
    h = h,
    xreg = ctx_result$context_vectors,
    ...
  ))
}

The function:

  1. Computes attention weights for the entire time series
  2. Generates context vectors from these weights
  3. Passes them as external regressors (xreg) to ridge2f()
  4. Returns forecasts enhanced by attention-weighted historical information

Example: AirPassengers Data

Let’s see this in action with the classic AirPassengers dataset:

library(ahead)

# Generate forecasts with attention-based context vectors
result <- ahead::contextridge2f(
  AirPassengers, 
  lags = 15L,      # Use 15 lagged values
  h = 15L,         # Forecast 15 steps ahead
  attention_type = "exponential",
  decay_factor = 5.0
)

# Visualize
plot(result)


# Other example
plot(ahead::contextridge2f(fdeaths, h = 20, lags = 15, 
attention_type = "exponential"))

What would make this approach effective?

1. Adaptive Weighting

Unlike fixed lag structures, attention mechanisms adapt the influence of past observations based on the data’s characteristics.

2. Captures Long-Range Dependencies

By computing weighted sums over the entire history, context vectors can capture patterns that extend beyond fixed window sizes.

3. Multiple Perspectives

Different attention mechanisms capture different aspects of temporal structure:

4. RVFL Architecture Benefits

The doubly-constrained RVFL network (Moudiki et al., 2018) provides:

5. Computational Efficiency

Context vectors are pre-computed once, and RVFL training is much faster than standard neural networks.

How It Compares to Standard RVFL

Standard doubly-constrained RVFL for time series uses lagged values directly:

# Standard approach
ahead::ridge2f(y, h = 15, lags = 15)

Our attention-enhanced version adds context vectors that encode weighted historical information:

# Attention-enhanced approach
ahead::contextridge2f(y, h = 15, lags = 15, attention_type = "exponential")

The context vectors provide additional features that capture temporal patterns the raw lags might miss. The RVFL network then learns both:

Choosing Attention Types

Different attention mechanisms suit different data patterns:

Attention Type Best For Key Parameter
exponential General use, smooth trends decay_factor
gaussian Seasonal patterns sigma
value_based Regime changes sensitivity
hybrid Complex patterns decay_factor, sensitivity
cosine Local similarity window_size
linear Simple recency bias None

For the AirPassengers data, exponential attention works well because recent observations are highly informative for future trends and seasonal patterns.

Why RVFL Instead of Standard Neural Networks?

The doubly-constrained RVFL approach (Moudiki et al., 2018) offers several advantages over traditional neural networks:

Speed

Simplicity

Dual Regularization

Architecture

Input (lags + context) → [Random Hidden Layer] → Output
                    ↘                         ↗
                      [Direct Connections]

The direct connections preserve linear relationships while random hidden layers capture nonlinearities—best of both worlds.

Tuning Parameters

Decay Factor (for exponential/hybrid)

Window Size (for cosine)

Sensitivity (for value-based/hybrid)

Implementation Details

The underlying ahead::computeattention() function is implemented in C++ (via Rcpp) for efficiency, computing:

  1. Attention weights: An n×n matrix where entry (i,j) represents the attention weight of time j on time i
  2. Context vectors: Weighted sums using these attention weights

The attention computation enforces causal constraints—time point t can only attend to observations at times j ≤ t, ensuring no future information leakage.

Practical Considerations

When to Use This Approach

Good fit:

May not help:

Computational Cost

Context vector computation is O(n²) due to the attention matrix, but:

Extensions and Future Work

Several interesting extensions are possible:

  1. Multi-head attention: Combine multiple attention types
  2. Learned parameters: Optimize attention parameters via cross-validation
  3. Multivariate attention: Extend to multiple time series with cross-series attention
  4. Hierarchical attention: Different attention at different time scales

Conclusion

The ahead::contextridge2f() function demonstrates how attention mechanisms (widely applied in deep learning) can potentially enhance doubly-constrained RVFL networks for time series forecasting. By computing context vectors that encode weighted historical information, we give the model additional features that capture complex temporal dependencies.

The approach combines:

It is:

For the AirPassengers example, the attention-enhanced RVFL forecasts successfully capture both the upward trend and seasonal fluctuations, extending the pattern 15 months into the future.

Try It Yourself

# Install packages (if needed)
# install.packages("ahead")
# devtools::install_github("Techtonique/ahead")  # for computeattention

library(ahead)

# Basic usage
result <- ahead::contextridge2f(AirPassengers, h = 10)

# With custom attention
result2 <- ahead::contextridge2f(
  AirPassengers,
  h = 15,
  attention_type = "hybrid",
  decay_factor = 7.0,
  sensitivity = 1.5,
  lags = 12
)

plot(result2)

Code Availability

The complete implementation including:

Is available at Techtonique GitHub repository.


References:

Keywords: time series forecasting, attention mechanisms, RVFL networks, doubly-constrained regularization, context vectors, machine learning, R programming

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version