[This article was first published on RStudio AI Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

“) training_loop(ds_train)

test_batch <- as_iterator(ds_test) %>% iter_next() encoded <- encoder(test_batch[[1]]) test_var <- tf$$math$$reduce_variance(encoded, axis = 0L) print(test_var %>% as.numeric() %>% round(5)) }

On to what we'll use as a baseline for comparison.

#### Vanilla LSTM

Here is the vanilla LSTM, stacking two layers, each, again, of size 32. Dropout and recurrent dropout were chosen individually
per dataset, as was the learning rate.

### Data preparation

For all experiments, data were prepared in the same way.

In every case, we used the first 10000 measurements available in the respective .pkl files [provided by Gilpin in his GitHub
repository](https://github.com/williamgilpin/fnn/tree/master/datasets). To save on file size and not depend on an external
data source, we extracted those first 10000 entries to .csv files downloadable directly from this blog's repo:

Should you want to access the complete time series (of considerably greater lengths), just download them from Gilpin's repo
and load them using reticulate:

Here is the data preparation code for the first dataset, geyser - all other datasets were treated the same way.

Now we're ready to look at how forecasting goes on our four datasets.

## Experiments

### Geyser dataset

People working with time series may have heard of [Old Faithful](https://en.wikipedia.org/wiki/Old_Faithful), a geyser in
Wyoming, US that has continually been erupting every 44 minutes to two hours since the year 2004. For the subset of data
Gilpin extracted[^3],

[^3]: see dataset descriptions in the [repository\'s README](https://github.com/williamgilpin/fnn)

> geyser_train_test.pkl corresponds to detrended temperature readings from the main runoff pool of the Old Faithful geyser
> in Yellowstone National Park, downloaded from the [GeyserTimes database](https://geysertimes.org/). Temperature measurements
> start on April 13, 2015 and occur in one-minute increments.

Like we said above, geyser.csv is a subset of these measurements, comprising the first 10000 data points. To choose an
adequate timestep for the LSTMs, we inspect the series at various resolutions:

<div class="figure">
<img src="images/geyser_ts.png" alt="Geyer dataset. Top: First 1000 observations. Bottom: Zooming in on the first 200." width="450" />
<p class="caption">(\#fig:unnamed-chunk-5)Geyer dataset. Top: First 1000 observations. Bottom: Zooming in on the first 200.</p>
</div>

It seems like the behavior is periodic with a period of about 40-50; a timestep of 60 thus seemed like a good try.

Having trained both FNN-LSTM and the vanilla LSTM for 200 epochs, we first inspect the variances of the latent variables on
the test set. The value of fnn_multiplier corresponding to this run was 0.7.

{}
V1     V2        V3          V4       V5       V6       V7       V8       V9      V10
0.258 0.0262 0.0000627 0.000000600 0.000533 0.000362 0.000238 0.000121 0.000518 0.000365

There is a drop in importance between the first two variables and the rest; however, unlike in the Lorenz system, V1 and V2 variances also differ by an order of magnitude.

Now, it’s interesting to compare prediction errors for both models. We are going to make an observation that will carry through to all three datasets to come.

Keeping up the suspense for a while, here is the code used to compute per-timestep prediction errors from both models. The same code will be used for all other datasets.

And here is the actual comparison. One thing especially jumps to the eye: FNN-LSTM forecast error is significantly lower for initial timesteps, first and foremost, for the very first prediction, which from this graph we expect to be pretty good!

Interestingly, we see “jumps” in prediction error, for FNN-LSTM, between the very first forecast and the second, and then between the second and the ensuing ones, reminding of the similar jumps in variable importance for the latent code! After the first ten timesteps, vanilla LSTM has caught up with FNN-LSTM, and we won’t interpret further development of the losses based on just a single run’s output.

Instead, let’s inspect actual predictions. We randomly pick sequences from the test set, and ask both FNN-LSTM and vanilla LSTM for a forecast. The same procedure will be followed for the other datasets.

Here are sixteen random picks of predictions on the test set. The ground truth is displayed in pink; blue forecasts are from FNN-LSTM, green ones from vanilla LSTM.

What we expect from the error inspection comes true: FNN-LSTM yields significantly better predictions for immediate continuations of a given sequence.

Let’s move on to the second dataset on our list.

### Electricity dataset

This is a dataset on power consumption, aggregated over 321 different households and fifteen-minute-intervals.

electricity_train_test.pkl corresponds to average power consumption by 321 Portuguese households between 2012 and 2014, in units of kilowatts consumed in fifteen minute increments. This dataset is from the UCI machine learning database. 1

Here, we see a very regular pattern:

With such regular behavior, we immediately tried to predict a higher number of timesteps (120) – and didn’t have to retract behind that aspiration.

For an fnn_multiplier of 0.5, latent variable variances look like this:

V1          V2            V3       V4       V5            V6       V7         V8      V9     V10
0.390 0.000637 0.00000000288 1.48e-10 2.10e-11 0.00000000119 6.61e-11 0.00000115 1.11e-4 1.40e-4

We definitely see a sharp drop already after the first variable.

How do prediction errors compare on the two architectures?

Here, FNN-LSTM performs better over a long range of timesteps, but again, the difference is most visible for immediate predictions. Will an inspection of actual predictions confirm this view?

It does! In fact, forecasts from FNN-LSTM are very impressive on all time scales.

Now that we’ve seen the easy and predictable, let’s approach the weird and difficult.

### ECG dataset

Says Gilpin,

ecg_train.pkl and ecg_test.pkl correspond to ECG measurements for two different patients, taken from the PhysioNet QT database. 2

How do these look?

To the layperson that I am, these do not look nearly as regular as expected. First experiments showed that both architectures are not capable of dealing with a high number of timesteps. In every try, FNN-LSTM performed better for the very first timestep.

This is also the case for n_timesteps = 12, the final try (after 120, 60 and 30). With an fnn_multiplier of 1, the latent variances obtained amounted to the following:

     V1        V2          V3        V4         V5       V6       V7         V8         V9       V10
0.110  1.16e-11     3.78e-9 0.0000992    9.63e-9  4.65e-5  1.21e-4    9.91e-9    3.81e-9   2.71e-8

There is a gap between the first variable and all other ones; but not much variance is explained by V1 either.

Apart from the very first prediction, vanilla LSTM shows lower forecast errors this time; however, we have to add that this was not consistently observed when experimenting with other timestep settings.

Looking at actual predictions, both architectures perform best when a persistence forecast is adequate – in fact, they produce one even when it is not.

On this dataset, we certainly would want to explore other architectures better able to capture the presence of high and low frequencies in the data, such as mixture models. But – were we forced to stay with one of these, and could do a one-step-ahead, rolling forecast, we’d go with FNN-LSTM.

Speaking of mixed frequencies – we haven’t seen the extremes yet …

### Mouse dataset

“Mouse”, that’s spike rates recorded from a mouse thalamus.

mouse.pkl A time series of spiking rates for a neuron in a mouse thalamus. Raw spike data was obtained from CRCNS and processed with the authors’ code in order to generate a spike rate time series. 3

Obviously, this dataset will be very hard to predict. How, after “long” silence, do you know that a neuron is going to fire?

As usual, we inspect latent code variances (fnn_multiplier was set to 0.4):

Again, we don’t see the first variable explaining much variance. Still, interestingly, when inspecting forecast errors we get a picture very similar to the one obtained on our first, geyser`, dataset:

So here, the latent code definitely seems to help! With every timestep “more” that we try to predict, prediction performance goes down continuously – or put the other way round, short-time predictions are expected to be pretty good!

Let’s see:

In fact on this dataset, the difference in behavior between both architectures is striking. When nothing is “supposed to happen”, vanilla LSTM produces “flat” curves at about the mean of the data, while FNN-LSTM takes the effort to “stay on track” as long as possible before also converging to the mean. Choosing FNN-LSTM – had we to choose one of these two – would be an obvious decision with this dataset.

## Discussion

When, in timeseries forecasting, would we consider FNN-LSTM? Judging by the above experiments, conducted on four very different datasets: Whenever we consider a deep learning approach. Of course, this has been a casual exploration – and it was meant to be, as – hopefully – was evident from the nonchalant and bloomy (sometimes) writing style.

Throughout the text, we’ve emphasized utility – how could this technique be used to improve predictions? But, looking at the above results, a number of interesting questions come to mind. We already speculated (though in an indirect way) whether the number of high-variance variables in the latent code was relatable to how far we could sensibly forecast into the future. However, even more intriguing is the question of how characteristics of the dataset itself affect FNN efficiency.

Such characteristics could be:

• How nonlinear is the dataset? (Put differently, how incompatible, as indicated by some form of test algorithm, is it with the hypothesis that the data generation mechanism was a linear one?)

• To what degree does the system appear to be sensitively dependent on initial conditions? In other words, what is the value of its (estimated, from the observations) highest Lyapunov exponent?

• What is its (estimated) dimensionality, for example, in terms of correlation dimension?

While it is easy to obtain those estimates, using, for instance, the nonlinearTseries package explicitly modeled after practices described in Kantz & Schreiber’s classic [@Kantz], we don’t want to extrapolate from our tiny sample of datasets, and leave such explorations and analyses to further posts, and/or the interested reader’s ventures :-). In any case, we hope you enjoyed the demonstration of practical usability of an approach that in the preceding post, was mainly introduced in terms of its conceptual attractivity.

1. again, citing from Gilpin’s repository’s README.↩︎

2. again, citing from Gilpin’s repository’s README.↩︎

3. again, citing from Gilpin’s repository’s README.↩︎

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)