**R Views**, and kindly contributed to R-bloggers)

Today, we continue our work on sampling so that we can run models on subsets of our data and then test the accuracy of the models on data not included in those subsets. In the machine learning prediction world, these two data sets are often called `training`

data and `testing`

data, but we’re not going to do any machine learning prediction today. We’ll stay with our good’ol Fama French regression models for the reasons explained last time: the goal is to explore a new method of sampling our data and I prefer to do that in the context of a familiar model and data set. In 2019, we’ll start expanding our horizons to different models and data, but for today it’s more of a tools exploration.

Loyal readers of this blog (if that’s you, thank you!) will remember that we undertook a similar project in the previous post, when we used k-fold cross-validation. Today, we will use rolling origin sampling of the data, which differs from k-fold cross-validation in the sense that with rolling origin we explicitly sample based on the dates of our observation. With rolling origin, our first sample starts on the first day, our second sample starts on the second day, our third sample starts on third day. Or, we could have the second sample start on the twentieth day, or we could have it start again on the first day and just include an extra day. Either way, we are aware of the order of our data when sampling. With k-fold cross-validation, we were randomly shuffling and then selecting observations. This distinction can be particularly important for economic time series where we believe that the order of our observations is important.

Let’s get to it.

First, we need our data and, as usual, we’ll import data for daily prices of 5 ETFs, convert them to returns (have a look here for a refresher on that code flow), then import the 5 Fama French factor data and join it to our 5 ETF returns data. Here’s the code to make that happen (this code was covered in detail in this post:

```
library(tidyverse)
library(tidyquant)
library(rsample)
library(broom)
library(timetk)
symbols <- c("SPY", "EFA", "IJS", "EEM", "AGG")
# The prices object will hold our daily price data.
prices <-
getSymbols(symbols, src = 'yahoo',
from = "2012-12-31",
to = "2017-12-31",
auto.assign = TRUE,
warnings = FALSE) %>%
map(~Ad(get(.))) %>%
reduce(merge) %>%
`colnames<-`(symbols)
asset_returns_long <-
prices %>%
tk_tbl(preserve_index = TRUE, rename_index = "date") %>%
gather(asset, returns, -date) %>%
group_by(asset) %>%
mutate(returns = (log(returns) - log(lag(returns)))) %>%
na.omit()
factors_data_address <-
"http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Global_5_Factors_Daily_CSV.zip"
factors_csv_name <- "Global_5_Factors_Daily.csv"
temp <- tempfile()
download.file(
# location of file to be downloaded
factors_data_address,
# where we want R to store that file
temp,
quiet = TRUE)
Global_5_Factors <-
read_csv(unz(temp, factors_csv_name), skip = 6 ) %>%
rename(date = X1, MKT = `Mkt-RF`) %>%
mutate(date = ymd(parse_date_time(date, "%Y%m%d")))%>%
mutate_if(is.numeric, funs(. / 100)) %>%
select(-RF)
data_joined_tidy <-
asset_returns_long %>%
left_join(Global_5_Factors, by = "date") %>%
na.omit()
```

After running that code, we have an object called `data_joined_tidy`

. It holds daily returns for 5 ETFs and the Fama French factors. Here’s a look at the first row for each ETF rows.

```
data_joined_tidy %>%
slice(1)
```

```
# A tibble: 5 x 8
# Groups: asset [5]
date asset returns MKT SMB HML RMW CMA
```
1 2013-01-02 AGG -0.00117 0.0199 -0.0043 0.0028 -0.0028 -0.0023
2 2013-01-02 EEM 0.0194 0.0199 -0.0043 0.0028 -0.0028 -0.0023
3 2013-01-02 EFA 0.0154 0.0199 -0.0043 0.0028 -0.0028 -0.0023
4 2013-01-02 IJS 0.0271 0.0199 -0.0043 0.0028 -0.0028 -0.0023
5 2013-01-02 SPY 0.0253 0.0199 -0.0043 0.0028 -0.0028 -0.0023

For today, let’s work with just the daily returns of EEM.

```
eem_for_roll <- data_joined_tidy %>%
filter(asset == "EEM")
```

We are going to regress those EEM daily returns on the Fama French factors and need a way to measure the accuracy of our various factor models. Previously, we measured our models by looking at the adjusted R-squared, when the models were run on the entirety of the data. Last time, we use k-fold cross-validation to split the data into a bunch of subsets, then ran our model on the subsets and measured the accuracy against the holdouts from the subsets. Now, let’s use the `rolling_origin()`

function from `rsample`

to split our data into `analysis`

and `assessment`

sets in a time-aware way, then calculate RMSEs.

The `rolling_origin()`

function needs a few arguments set. We first set `data`

to be `eem_for_roll`

Then, we assign `initial`

to be `50`

– this tells the function that the size of our first sample is 50 days. Our first chunk of `analysis`

data will be the first 50 days of EEM returns. Next, we assign `assess`

to be `5`

– this tells the function that our `assessment`

data is the 5 days of EEM returns following those first 50 days. Finally, we set `cumulative`

to be `FALSE`

– this tells the functions that each of splits is a 50 days. The first split is the first 50 days, starting on day 1 and ending on day 50. The next split starts on day 2 and ends on day 51. If we were to set `cumulative = TRUE`

, the first split would be 50 days. The next split would be 51 days, the next split would be 52 days. And so on. The `analysis`

split days would be accumulating. For now, we will leave it at `cumulative = FALSE`

.

For that reason, we will append `_sliding`

to the name of our object because the start of our window will slide each time.

```
library(rsample)
roll_eem_sliding <-
rolling_origin(
data = eem_for_roll,
initial = 50,
assess = 10,
cumulative = FALSE
)
```

Look at an individual split.

```
one_eem_split <-
roll_eem_sliding$splits[[1]]
one_eem_split
```

`<50/10/1259>`

That `50`

is telling us there are 50 days or rows in the `analysis`

set; that `10`

is telling us that there are 10 rows in our `assessment`

data – we’ll see how well our model predicts the return 5 days after the last observation in our data.

Here is the `analysis`

subset of that split.

```
analysis(one_eem_split) %>%
head()
```

```
# A tibble: 6 x 8
# Groups: asset [1]
date asset returns MKT SMB HML RMW CMA
```
1 2013-01-02 EEM 0.0194 0.0199 -0.0043 0.0028 -0.0028 -0.0023
2 2013-01-03 EEM -0.00710 -0.0021 0.00120 0.000600 0.0008 0.0013
3 2013-01-04 EEM 0.00200 0.0064 0.0011 0.0056 -0.0043 0.0036
4 2013-01-07 EEM -0.00759 -0.0014 0.00580 0 -0.0015 0.0001
5 2013-01-08 EEM -0.00900 -0.0027 0.0018 -0.00120 -0.0002 0.00120
6 2013-01-09 EEM 0.00428 0.0036 0.000300 0.0025 -0.0028 0.001

And the `assessment`

subset – this is 10 rows.

`assessment(one_eem_split)`

```
# A tibble: 10 x 8
# Groups: asset [1]
date asset returns MKT SMB HML RMW CMA
```
1 2013-03-15 EEM -0.00908 0.0025 0.006 -0.0001 0.0018 -0.0001
2 2013-03-18 EEM -0.0113 -0.0093 0.0021 -0.0039 0.0039 -0.001
3 2013-03-19 EEM -0.00712 -0.003 0.0015 -0.0023 0.0017 0.0028
4 2013-03-20 EEM 0.00570 0.0052 -0.003 0.0023 -0.002 0.002
5 2013-03-21 EEM -0.0102 -0.0037 0.0085 -0.0007 0.00120 0.0008
6 2013-03-22 EEM 0.00382 0.0033 -0.0042 -0.0023 0.0022 -0.0007
7 2013-03-25 EEM -0.000954 -0.0037 0.0036 -0.0032 0.0027 0.000600
8 2013-03-26 EEM 0.0142 0.0039 -0.0032 -0.0017 0.00120 -0.0017
9 2013-03-27 EEM 0.00376 -0.0016 0.0022 -0.0004 -0.0002 -0.0008
10 2013-03-28 EEM 0.00211 0.0033 -0.0022 -0.0031 0.000600 0.000600

By way of comparison, here’s what the k-fold cross-validated data would look like.

```
cved_eem <-
vfold_cv(eem_for_roll, v = 5)
assessment(cved_eem$splits[[1]]) %>%
head()
```

```
# A tibble: 6 x 8
# Groups: asset [1]
date asset returns MKT SMB HML RMW CMA
```
1 2013-01-04 EEM 0.00200 0.0064 0.0011 0.0056 -0.0043 0.0036
2 2013-01-14 EEM 0.00426 -0.000600 0.0002 0.0013 -0.001 0.0011
3 2013-01-31 EEM 0.00204 -0.0021 0.0038 0.000300 -0.0007 -0.0001
4 2013-02-04 EEM -0.0133 -0.0107 0.0059 -0.0027 0.0023 0.0004
5 2013-02-05 EEM 0.00114 0.0034 -0.0054 -0.0002 0.0008 -0.002
6 2013-02-06 EEM -0.00114 0.0019 0.0037 0.0022 -0.0018 0.0028

Notice how the first date is not necessarily the first date in our data. In fact, if you run that code chunk a few times, the first date will be randomly selected. For me, it varied between January 2nd and January 15th.

Back to our `rolling_origin`

data, we know that split 1 begins on day 1 and ends on day 50:

```
analysis(roll_eem_sliding$splits[[1]]) %>%
slice(c(1,50))
```

```
# A tibble: 2 x 8
# Groups: asset [1]
date asset returns MKT SMB HML RMW CMA
```
1 2013-01-02 EEM 0.0194 0.0199 -0.0043 0.0028 -0.0028 -0.0023
2 2013-03-14 EEM 0.00418 0.0078 0.0021 0.0016 -0.002 0.0026

And we know split 2 begins on day 2 and ends on day 51:

```
analysis(roll_eem_sliding$splits[[2]]) %>%
slice(c(1,50))
```

```
# A tibble: 2 x 8
# Groups: asset [1]
date asset returns MKT SMB HML RMW CMA
```
1 2013-01-03 EEM -0.00710 -0.0021 0.00120 0.000600 0.0008 0.0013
2 2013-03-15 EEM -0.00908 0.0025 0.006 -0.0001 0.0018 -0.0001

Now, we can start using our data to fit and then assess our model. The code flow is very similar to our previous post, but I’ll go ahead and run through it anyway.

First, let’s create a function that takes a split as an argument, fits a 3-factor Fama French model and calculates the root mean squared error. We will use the `rmse()`

function from `yardstick`

to measure our model. RMSE stands for root mean-squared error. It’s the sum of the squared differences between our fitted values and the actual values in the `assessment`

data. A lower RMSE is better!

```
library(yardstick)
ff_three_rmse <- function(split){
analysis_set <- analysis(split)
ff_model <- lm(returns ~ MKT + SMB + HML , data = analysis_set)
holdout <- assessment(split)
rmse <-
ff_model %>%
augment(newdata = holdout) %>%
rmse(returns, .fitted) %>%
pull(.estimate)
}
```

Let’s pass our one split object `one_eem_split`

to the function.

```
ff_three_rmse(one_eem_split) %>%
print()
```

`[1] 0.005311276`

Now, we can use `mutate()`

and `map_dbl()`

to pass all of our splits held in `roll_eem_sliding`

to the function. This will return an `rmse`

for this model when applied to each of our rolling origin splits. This takes a few seconds to run.

```
rmse_one_model <-
roll_eem_sliding %>%
mutate(model = map_dbl(
.x = splits,
.f = ~(ff_three_rmse(.x))))
rmse_one_model %>%
head()
```

```
# A tibble: 6 x 3
splits id model
*
```
1 Slice0001 0.00531
2 Slice0002 0.00621
3 Slice0003 0.00642
4 Slice0004 0.00647
5 Slice0005 0.00654
6 Slice0006 0.00649

This is the same as we did last time. Now, let’s get functional. We will define three models that we wish to test with via our rolling origin splits and RMSE, then pass those models to one function.

First, we use `as.formula()`

to define our three models.

```
mod_form_1 <- as.formula(returns ~ MKT)
mod_form_2 <- as.formula(returns ~ MKT + SMB + HML)
mod_form_3 <- as.formula(returns ~ MKT + SMB + HML + RMW + CMA)
```

We write one function that takes `split`

as an argument, same as above, but also takes `formula`

as an argument, so we can pass it different models. This gives us the flexibility to more easily define new models and pass them to `map`

so I’ll append `_flex`

to the name of this function.

```
ff_rolling_flex <- function(split, formula) {
split_for_data <- analysis(split)
ff_model <- lm(formula, data = split_for_data)
holdout <- assessment(split)
rmse <-
ff_model %>%
augment(newdata = holdout) %>%
rmse(returns, .fitted) %>%
pull(.estimate)
}
rmse_model_1_flex <-
roll_eem_sliding %>%
mutate(model_1_rmse = map_dbl(roll_eem_sliding$splits,
ff_rolling_flex,
formula = mod_form_1))
rmse_model_1_flex %>%
head()
```

```
# A tibble: 6 x 3
splits id model_1_rmse
*
```
1 Slice0001 0.00601
2 Slice0002 0.00548
3 Slice0003 0.00547
4 Slice0004 0.00556
5 Slice0005 0.00544
6 Slice0006 0.00539

Again same as last time, we can run all of our models using the `mutate()`

and `map_dbl`

combination. Warning, this one takes a bit of time and compute to run – proceed with caution if you’re on a desktop.

```
rmse_3_models <-
roll_eem_sliding %>%
mutate(model_1_rmse = map_dbl(roll_eem_sliding$splits,
ff_rolling_flex,
formula = mod_form_1),
model_2_rmse = map_dbl(roll_eem_sliding$splits,
ff_rolling_flex,
formula = mod_form_2),
model_3_rmse = map_dbl(roll_eem_sliding$splits,
ff_rolling_flex,
formula = mod_form_3))
rmse_3_models %>%
head()
```

```
# A tibble: 6 x 5
splits id model_1_rmse model_2_rmse model_3_rmse
*
```
1 Slice0001 0.00601 0.00531 0.00496
2 Slice0002 0.00548 0.00621 0.00596
3 Slice0003 0.00547 0.00642 0.00617
4 Slice0004 0.00556 0.00647 0.00615
5 Slice0005 0.00544 0.00654 0.00613
6 Slice0006 0.00539 0.00649 0.00600

Alright, we have our RMSE, from each of our 3 models, as applied to each of our splits.

Thus far, our substantive flow is very similar to our k-fold cross-validation work. But now, we can take advantage of the time-aware nature of our splits.

Let’s visualize how our RMSE has changed over different time-aware splits, for our various models. Remember, we know the exact start and end date for our `analysis`

and `assessment`

sets, so we can extract the date of, say the first observation in the `assessment`

data and assign it to the split. We can consider this the date of that model run.

First, let’s create a function to extract the first date of each of our `assessment`

sets.

```
get_start_date <- function(x)
min(assessment(x)$date)
```

Here’s how that works on our one split object.

`get_start_date(one_eem_split)`

`[1] "2013-03-15"`

That’s the first date in the assessment data:

```
assessment(one_eem_split) %>%
select(date) %>%
slice(1)
```

```
# A tibble: 1 x 2
# Groups: asset [1]
asset date
```
1 EEM 2013-03-15

We want to add a column to our `results_3`

object called `start_date`

. We’ll use our usual `mutate()`

and then `map()`

flow to apply the `get_start_date()`

function to each of our splits, but we’ll need to pipe the result to `reduce(c)`

to coerce the result to a date. `map()`

returns a list by default and we want a vector of dates.

```
rmse_3_models_with_start_date <-
rmse_3_models %>%
mutate(start_date = map(roll_eem_sliding$splits, get_start_date) %>% reduce(c)) %>%
select(start_date, everything())
rmse_3_models_with_start_date %>%
head()
```

```
# A tibble: 6 x 6
start_date splits id model_1_rmse model_2_rmse model_3_rmse
*
```
1 2013-03-15 Slice… 0.00601 0.00531 0.00496
2 2013-03-18 Slice… 0.00548 0.00621 0.00596
3 2013-03-19 Slice… 0.00547 0.00642 0.00617
4 2013-03-20 Slice… 0.00556 0.00647 0.00615
5 2013-03-21 Slice… 0.00544 0.00654 0.00613
6 2013-03-22 Slice… 0.00539 0.00649 0.00600

We can head to `ggplot`

for some visualizing. I’d like to plot all of my RMSE’s in different colors and the best way to do that is to `gather()`

this data to tidy format, with a column called `model`

and a column called `value`

. It’s necessary to coerce to a data frame first, using `as.data.frame()`

.

```
rmse_3_models_with_start_date %>%
as.data.frame() %>%
select(-splits, -id) %>%
gather(model, value, -start_date) %>%
head()
```

```
start_date model value
1 2013-03-15 model_1_rmse 0.006011868
2 2013-03-18 model_1_rmse 0.005483091
3 2013-03-19 model_1_rmse 0.005470834
4 2013-03-20 model_1_rmse 0.005557170
5 2013-03-21 model_1_rmse 0.005439921
6 2013-03-22 model_1_rmse 0.005391862
```

Next, we can use some of our familiar `ggplot`

methods to plot our RMSEs over time, and see if we notice this model degrading or improving in different periods.

```
rmse_3_models_with_start_date %>%
as.data.frame() %>%
select(-splits, -id) %>%
gather(model, value, -start_date) %>%
ggplot(aes(x = start_date, y = value, color = model)) +
geom_point() +
facet_wrap(~model, nrow = 2) +
scale_x_date(breaks = scales::pretty_breaks(n = 10)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
```

All of the models have a jump in RMSE, meaning they performed worse, around the end of 2017. We aren’t focused on the theory of these models today but if we were, this would be a good place to start investigating. This is just the beginning of our exploration of `rolling_origin`

, but I love how it opens the door for ways to think about visualizing our models.

And finally, those same public service announcements from last time are still live, so I’ll mention them one more time.

Thanks to everyone who has checked out my new book! The price just got lowered for the holidays. See on Amazon or on the CRC homepage (okay, that was more of an announcement about my book).

Applications are open for the Battlefin alternative data contest and RStudio is one of the tools you can use to analyze the data. Check it out here. They’ll announce 25 finalists in January who will get to compete for a cash prize and connect with some quant hedge funds. Go get’em!

A special thanks to Bruce Fox who suggested we might want to expand on the Monte Carlo simulation in the book to take account of different distributions implied by historical returns and different market regimes that might arise. Today’s rolling origin framework will also lay the foundation, I hope, for implementing some of Bruce’s ideas in January.

Thanks for reading and see you next time.

**leave a comment**for the author, please follow the link and comment on their blog:

**R Views**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...