Today we will continue our work on Fama French factor models, but more as a vehicle to explore some of the awesome stuff happening in the world of tidy models. For new readers who want get familiar with Fama French before diving into this post, see here where we covered importing and wrangling the data, here where we covered rolling models and visualization, my most recent previous post here where we covered managing many models, and if you’re into Shiny, this flexdashboard.
Our goal today is to explore kfold crossvalidation via the rsample
package, and a bit of model evaluation via the yardstick
package. We started the model evaluation theme last time when we used tidy()
, glance()
and augment()
from the broom
package. In this post, we will use the rmse()
function from yardstick
, but our main focus will be on the vfold_cv()
function from rsample
. We are going to explore these tools in the context of linear regression and Fama French, which might seem weird since these tools would typically be employed in the realms of machine learning, classification, and the like. We’ll stay in the world of explanatory models via linear regression world for a few reasons.
First, and this is a personal preference, when getting to know a new package or methodology, I prefer to do so in a context that’s already familiar. I don’t want to learn about rsample
whilst also getting to know a new data set and learning the complexities of some crazy machine learning model. Since Fama French is familiar from our previous work, we can focus on the new tools in rsample
and yardstick
. Second, factor models are important in finance, despite relying on good old linear regression. We won’t regret time spent on factor models, and we might even find creative new ways to deploy or visualize them.
The plan for today is take the same models that we ran in the last post, only this use kfold cross validation and bootstrapping to try to assess the quality of those models.
For that reason, we’ll be working with the same data as we did previously. I won’t go through the logic again, but in short, we’ll import data for daily prices of five ETFs, convert them to returns (have a look here for a refresher on that code flow), then import the five Fama French factor data and join it to our five ETF returns data. Here’s the code to make that happen:
library(tidyverse)
library(broom)
library(tidyquant)
library(timetk)
symbols < c("SPY", "EFA", "IJS", "EEM", "AGG")
# The prices object will hold our daily price data.
prices <
getSymbols(symbols, src = 'yahoo',
from = "20121231",
to = "20171231",
auto.assign = TRUE,
warnings = FALSE) %>%
map(~Ad(get(.))) %>%
reduce(merge) %>%
`colnames<`(symbols)
asset_returns_long <
prices %>%
tk_tbl(preserve_index = TRUE, rename_index = "date") %>%
gather(asset, returns, date) %>%
group_by(asset) %>%
mutate(returns = (log(returns)  log(lag(returns)))) %>%
na.omit()
factors_data_address <
"http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Global_5_Factors_Daily_CSV.zip"
factors_csv_name < "Global_5_Factors_Daily.csv"
temp < tempfile()
download.file(
# location of file to be downloaded
factors_data_address,
# where we want R to store that file
temp,
quiet = TRUE)
Global_5_Factors <
read_csv(unz(temp, factors_csv_name), skip = 6 ) %>%
rename(date = X1, MKT = `MktRF`) %>%
mutate(date = ymd(parse_date_time(date, "%Y%m%d")))%>%
mutate_if(is.numeric, funs(. / 100)) %>%
select(RF)
data_joined_tidy <
asset_returns_long %>%
left_join(Global_5_Factors, by = "date") %>%
na.omit()
After running that code, we have an object called data_joined_tidy
. It holds daily returns for 5 ETFs and the Fama French factors. Here’s a look at the first row for each ETF rows.
data_joined_tidy %>%
slice(1)
# A tibble: 5 x 8
# Groups: asset [5]
date asset returns MKT SMB HML RMW CMA
1 20130102 AGG 0.00117 0.0199 0.0043 0.0028 0.0028 0.0023
2 20130102 EEM 0.0194 0.0199 0.0043 0.0028 0.0028 0.0023
3 20130102 EFA 0.0154 0.0199 0.0043 0.0028 0.0028 0.0023
4 20130102 IJS 0.0271 0.0199 0.0043 0.0028 0.0028 0.0023
5 20130102 SPY 0.0253 0.0199 0.0043 0.0028 0.0028 0.0023
Let’s work with just one ETF for today and use filter(asset == "AGG")
to shrink our data down to just that ETF.
agg_ff_data <
data_joined_tidy %>%
filter(asset == "AGG")
Okay, we’re going to regress the daily returns of AGG on one factor, then three factors, then five factors, and we want to evaluate how well each model explains AGG’s returns. That means we need a way to test the model. Last time, we looked at the adjusted rsquared values when the model was run on the entirety of AGG returns. Today, we’ll evaluate the model using kfold cross validation. That’s a pretty jargonheavy phrase that isn’t part of the typical finance lexicon. Let’s start with the second part, crossvalidation
. Instead of running our model on the entire data set – all the daily returns of AGG – we’ll run it on just part of the data set, then test the results on the part that we did not use. Those different subsets of our original data are often called the training and the testing sets, though rsample
calls them the analysis
and assessment
sets. We validate the model results by applying them to the assessment
data and seeing how the model performed.
The kfold
bit refers to the fact that we’re not just dividing our data into training and testing subsets, we’re actually going to divide it into a bunch of groups, a k
number of groups, or a k
number of folds
. One of those folds will be used as the validation set; the model will be fit on the other k  1
sets, and then tested on the validation set. We’re doing this with a linear model to see how well it explains the data; it’s typically used in machine learning to see how well a model predicts data (we’ll get there in 2019).^{1}
If you’re like me, it will take a bit of tinkering to really grasp kfold cross validation, but rsample
as a great function for dividing our data into kfolds. If we wish to use five folds (the state of the art seems to be either five or ten folds), we call the vfold_cv()
function, pass it our data object agg_ff_data
, and set v = 5
.
library(rsample)
library(yardstick)
set.seed(752)
cved_ff<
vfold_cv(agg_ff_data, v = 5)
cved_ff
# 5fold crossvalidation
# A tibble: 5 x 2
splits id
1 Fold1
2 Fold2
3 Fold3
4 Fold4
5 Fold5
We have an object called cved_ff
, with a column called splits
and a column called id
. Let’s peek at the first split.
cved_ff$splits[[1]]
<1007/252/1259>
Three numbers. The first, 1007, is telling us how many observations are in the analysis
. Since we have five folds, we should have 80% (or 4/5) of our data in the analysis
set. The second number, 252, is telling us how many observations are in the assessment
, which is 20% of our original data. The third number, 1259, is the total number of observations in our original data.
Next, we want to apply a model to the analysis
set of this kfolded data and test the results on the assessment
set. Let’s start with one factor and run a simple linear model, lm(returns ~ MKT)
.
We want to run it on analysis(cved_ff$splits[[1]])
– the analysis set of out first split.
ff_model_test < lm(returns ~ MKT, data = analysis(cved_ff$splits[[1]]))
ff_model_test
Call:
lm(formula = returns ~ MKT, data = analysis(cved_ff$splits[[1]]))
Coefficients:
(Intercept) MKT
0.0001025 0.0265516
Nothing too crazy so far. Now we want to test on our assessment data. The first step is to add that data to the original set. We’ll use augment()
for that task, and pass it assessment(cved_ff$splits[[1]])
ff_model_test %>%
augment(newdata = assessment(cved_ff$splits[[1]])) %>%
head() %>%
select(returns, .fitted)
returns .fitted
1 0.0009021065 1.183819e04
2 0.0011726989 4.934779e05
3 0.0010815505 1.157267e04
4 0.0024385815 7.544460e05
5 0.0021715702 8.341007e05
6 0.0028159467 3.865527e04
We just added our fitted values to the assessment
data, the subset of the data on which the model was not fit. How well did our model do when compare the fitted values to the data in the held out set?
We can use the rmse()
function from yardstick
to measure our model. RMSE stands for root meansquared error. It’s the sum of the squared differences between our fitted values and the actual values in the assessment
data. A lower RMSE is better!
ff_model_test %>%
augment(newdata = assessment(cved_ff$splits[[1]])) %>%
rmse(returns, .fitted)
# A tibble: 1 x 3
.metric .estimator .estimate
1 rmse standard 0.00208
Now that we’ve done that piece by piece, let’s wrap the whole operation into one function. This function takes one argument, a split
, and we’re going to use pull()
so we can extract the raw number, instead of the entire tibble
result.
model_one < function(split) {
split_for_model < analysis(split)
ff_model < lm(returns ~ MKT, data = split_for_model)
holdout < assessment(split)
rmse < ff_model %>%
augment(newdata = holdout) %>%
rmse(returns, .fitted) %>%
pull(.estimate)
}
Now we pass it our first split.
model_one(cved_ff$splits[[1]]) %>%
head()
[1] 0.002080324
Now we want to apply that function to each of our five folds that are stored in agg_cved_ff
. We do that with a combination of mutate()
and map_dbl()
. We use map_dbl()
instead of map
because we are returning a number here and there’s not a good reason to store that number in a list column.
cved_ff %>%
mutate(rmse = map_dbl(cved_ff$splits, model_one))
# 5fold crossvalidation
# A tibble: 5 x 3
splits id rmse
*
1 Fold1 0.00208
2 Fold2 0.00189
3 Fold3 0.00201
4 Fold4 0.00224
5 Fold5 0.00190
OK, we have five RMSE’s since we ran the model on five separate analysis
fold sets and tested on five separate assessment
fold sets. Let’s find the average RMSE by taking the mean()
of the rmse
column. That can help reduce noisiness that resulted from our random creation of those five folds.
cved_ff %>%
mutate(rmse = map_dbl(cved_ff$splits, model_one)) %>%
summarise(mean_rse = mean(rmse))
# 5fold crossvalidation
# A tibble: 1 x 1
mean_rse
1 0.00202
We now have the mean RMSE after running on our model, lm(returns ~ MKT)
, on all five of our folds.
That process for finding the mean RMSE can be applied other models, as well. Let’s suppose we wish to find the mean RMSE for two other models: lm(returns ~ MKT + SMB + HML)
, the Fama French threefactor model, and lm(returns ~ MKT + SMB + HML + RMW + CMA
, the Fama French fivefactor model. By comparing the mean RMSE’s, we can evaluate which model explained the returns of AGG better. Since we’re just adding more and more factors, the models can be expected to get more and more accurate but again, we are exploring the rsample
machinery and creating a template where we can pop in whatever models we wish to compare.
First, let’s create two new functions, that follow the exact same code pattern as above but house the threefactor and fivefactor models.
model_two < function(split) {
split_for_model < analysis(split)
ff_model < lm(returns ~ MKT + SMB + HML, data = split_for_model)
holdout < assessment(split)
rmse <
ff_model %>%
augment(newdata = holdout) %>%
rmse(returns, .fitted) %>%
pull(.estimate)
}
model_three < function(split) {
split_for_model < analysis(split)
ff_model < lm(returns ~ MKT + SMB + HML + RMW + CMA, data = split_for_model)
holdout < assessment(split)
rmse <
ff_model %>%
augment(newdata = holdout) %>%
rmse(returns, .fitted) %>%
pull(.estimate)
}
Now we pass those three models to the same mutate()
with map_dbl()
flow that we used with just one model. The result will be three new columns of RMSE’s, one for each of our three models applied to our five folds.
cved_ff %>%
mutate(
rmse_model_1 = map_dbl(
splits,
model_one),
rmse_model_2 = map_dbl(
splits,
model_two),
rmse_model_3 = map_dbl(
splits,
model_three))
# 5fold crossvalidation
# A tibble: 5 x 5
splits id rmse_model_1 rmse_model_2 rmse_model_3
*
1 Fold1 0.00208 0.00211 0.00201
2 Fold2 0.00189 0.00184 0.00178
3 Fold3 0.00201 0.00195 0.00194
4 Fold4 0.00224 0.00221 0.00213
5 Fold5 0.00190 0.00183 0.00177
We can also find the mean RMSE for each model.
cved_ff %>%
mutate(
rmse_model_1 = map_dbl(
splits,
model_one),
rmse_model_2 = map_dbl(
splits,
model_two),
rmse_model_3 = map_dbl(
splits,
model_three)) %>%
summarise(mean_rmse_model_1 = mean(rmse_model_1),
mean_rmse_model_2 = mean(rmse_model_2),
mean_rmse_model_3 = mean(rmse_model_3))
# 5fold crossvalidation
# A tibble: 1 x 3
mean_rmse_model_1 mean_rmse_model_2 mean_rmse_model_3
1 0.00202 0.00199 0.00192
That code flow worked just fine, but we had to repeat ourselves when creating the functions for each model. Let’s toggle to a flow where we define three models – the ones that we wish to test with via crossvalidation and RMSE – then pass those models to one function.
First, we use as.formula()
to define our three models.
mod_form_1 < as.formula(returns ~ MKT)
mod_form_2 < as.formula(returns ~ MKT + SMB + HML)
mod_form_3 < as.formula(returns ~ MKT + SMB + HML + RMW + CMA)
Now we write one function that takes split
as an argument, same as above, but also takes formula
as an argument, so we can pass it different models. This gives us the flexibility to more easily define new models and pass them to map
, so I’ll append _flex
to the name of this function.
ff_rmse_models_flex < function(split, formula) {
split_for_data < analysis(split)
ff_model < lm(formula, data = split_for_data)
holdout < assessment(split)
rmse <
ff_model %>%
augment(newdata = holdout) %>%
rmse(returns, .fitted) %>%
pull(.estimate)
}
Now we use the same code flow as before, except we call map_dbl()
, pass it our cved_ff$splits
object, our new flex
function called ff_rmse_models_flex()
, and the model we wish to pass as the formula
argument. First we pass it mod_form_1
.
cved_ff %>%
mutate(rmse_model_1 = map_dbl(cved_ff$splits,
ff_rmse_models_flex,
formula = mod_form_1))
# 5fold crossvalidation
# A tibble: 5 x 3
splits id rmse_model_1
*
1 Fold1 0.00208
2 Fold2 0.00189
3 Fold3 0.00201
4 Fold4 0.00224
5 Fold5 0.00190
Now let’s pass it all three models and find the mean RMSE.
cved_ff %>%
mutate(rmse_model_1 = map_dbl(cved_ff$splits,
ff_rmse_models_flex,
formula = mod_form_1),
rmse_model_2 = map_dbl(cved_ff$splits,
ff_rmse_models_flex,
formula = mod_form_2),
rmse_model_3 = map_dbl(cved_ff$splits,
ff_rmse_models_flex,
formula = mod_form_3)) %>%
summarise(mean_rmse_model_1 = mean(rmse_model_1),
mean_rmse_model_2 = mean(rmse_model_2),
mean_rmse_model_3 = mean(rmse_model_3))
# 5fold crossvalidation
# A tibble: 1 x 3
mean_rmse_model_1 mean_rmse_model_2 mean_rmse_model_3
1 0.00202 0.00199 0.00192
Alright, that code flow seems a bit more flexible than our original method of writing a function to assess each model. We didn’t do much hard thinking about functional form here, but hopefully this provides a template where you could assess more nuanced models. We’ll get into bootstrapping and time series work next week, then head to Shiny to ring in the New Year!
And, finally, a couple of public service announcements.
First, thanks to everyone who has checked out my new book! The price just got lowered for the holidays. See on Amazon or on the CRC homepage (okay, that was more of an announcement about my book).
Second, applications are open for the Battlefin alternative data contest, and RStudio is one of the tools you can use to analyze the data. Check it out here. In January, they’ll announce 25 finalists who will get to compete for a cash prize and connect with some quant hedge funds. Go get ‘em!
Thanks for reading and see you next time.

For more on crossvalidation, see “An Introduction to Statistical Learning”, chapter 5. Available online here: http://wwwbcf.usc.edu/~gareth/ISL/.↩
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...