Parsnipping Fama French

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today, we will continue our exploration of developments in the world of tidy models, and we will stick with our usual Fama French modeling flow to do so. For new readers who want get familiar with Fama French before diving into this post, see here where we covered importing and wrangling the data, here where we covered rolling models and visualization, and here where we covered managing many models. If you’re into Shiny, this flexdashboard might be of interest, as well.

Let’s get to it.

First, we need our data and, as usual, we’ll import data for daily prices of five ETFs, convert them to returns (have a look here for a refresher on that code flow), then import the five Fama French factor data and join it to our five ETF returns data. Here’s the code to make that happen (this code was covered in detail in this post:

symbols <- c("SPY", "EFA", "IJS", "EEM", "AGG")

# The prices object will hold our daily price data.
prices <- 
             src = 'yahoo', 
             from = "2012-12-31",
             to = "2017-12-31",
             auto.assign = TRUE, 
             warnings = FALSE) %>% 
  map(~Ad(get(.))) %>% 
  reduce(merge) %>%

asset_returns_long <-  
  prices %>% 
  tk_tbl(preserve_index = TRUE, rename_index = "date") %>%
  gather(asset, prices, -date) %>% 
  group_by(asset) %>%  
  mutate(daily_returns = (log(prices) - log(lag(prices)))) %>% 

factors_data_address <- 

factors_csv_name <- "Global_5_Factors_Daily.csv"

temp <- tempfile()

  # location of file to be downloaded
  # where we want R to store that file
  quiet = TRUE)

Global_5_Factors <- 
  read_csv(unz(temp, factors_csv_name), skip = 6 ) %>%
  rename(date = X1, MKT = `Mkt-RF`) %>%
  mutate(date = ymd(parse_date_time(date, "%Y%m%d")))%>%
  mutate_if(is.numeric, funs(. / 100)) %>% 

data_joined_tidy <- 
  asset_returns_long %>%
  left_join(Global_5_Factors, by = "date") %>% 

For today, let’s work with just the SPY data by filtering our data set by asset.

spy_2013_2017 <- data_joined_tidy %>% 
  filter(asset == "SPY")

Next, we re-sample this five years’ worth of data into smaller subsets of training and testing sets. This is frequently done by k-fold cross validation (see here for an example), where random samples are taken from the data, but since we are working with time series, we will use a time-aware technique. The rsample package has a function for exactly this purpose, the rolling_origin() function. We covered this process extensively in this previous post. Here’s the code to make it happen.

rolling_origin_spy_2013_2017 <- 
  data       = spy_2013_2017,
  initial    = 100,
  assess     = 1,
  cumulative = FALSE

rolling_origin_spy_2013_2017 %>% 
[1] 1159    2

We now have a data object called rolling_origin_spy_2013_2017 that holds 1159 splits of data. Each split consists of an analysis data set with 100 days of return and factor data, and an assessment data set with one day of return and factor data.

Now, we can start using that collection of data splits to fit a model on the assessment data, and then test our model on the assessment data. That means it’s time to introduce a relatively new addition to the R tool chain, the parsnip package.

parsnip is a unified model interface that allows us to create a model specification, set an analytic engine, and then fit a model. It’s a ‘unified’ interface in the sense that we can use the same scaffolding but insert different models, or different engines, or different modes. Let’s see how that works with linear regression.

Recall that in the previous post, we piped our data into a linear model like so:

analysis(rolling_origin_spy_2013_2017$splits[[1]]) %>% 
do(model = lm(daily_returns ~ MKT + SMB + HML + RMW + CMA, 
      data = .)) %>% 
# A tibble: 6 x 6
# Groups:   asset [1]
  asset term         estimate std.error statistic  p.value
  <chr> <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 SPY   (Intercept)  0.000579  0.000338      1.71 8.98e- 2
2 SPY   MKT          0.909     0.0739       12.3  2.79e-21
3 SPY   SMB         -0.495     0.112        -4.43 2.52e- 5
4 SPY   HML         -0.609     0.208        -2.92 4.38e- 3
5 SPY   RMW         -0.591     0.259        -2.28 2.47e- 2
6 SPY   CMA         -0.395     0.206        -1.92 5.81e- 2

Now, we will pipe into the parsnip scaffolding, which will allow us to quickly change to a different model and specification further down in the code.

Since we are running a linear regression, we first create a specification with linear_reg(), then set the engine with set_engine("lm"), and finally fit the model with fit(five_factor_model, data = one of our splits)

lm_model <-
  linear_reg() %>%
  set_engine("lm") %>%
  fit(daily_returns ~ MKT + SMB + HML + RMW + CMA, 
      data = analysis(rolling_origin_spy_2013_2017$splits[[1]]))

parsnip model object

stats::lm(formula = formula, data = data)

(Intercept)          MKT          SMB          HML          RMW  
  0.0005794    0.9086303   -0.4951297   -0.6085088   -0.5910375  

Now that we’ve fit the model on our test set, let’s see how well it predicted the test set. We can use the predict() function and pass it the results of our parnsip code flow, along with the assessment split.

assessment(rolling_origin_spy_2013_2017$splits[[1]]) %>% 
  select(returns) %>% 
        new_data = assessment(rolling_origin_spy_2013_2017$splits[[1]])))
# A tibble: 1 x 3
# Groups:   asset [1]
  asset returns   .pred
  <chr>   <dbl>   <dbl>
1 SPY      148. 0.00737

That worked well, but now let’s head to a more complex model and use the ranger package as an engine for a random forest analysis.

To set up the ranger random forest model in parsnip, we first use rand_forest(mode = "regression", mtry = 3, trees = 100) to create the specification, set_engine("ranger") to set the engine as the ranger package, and fit(daily_returns ~ MKT + SMB + HML + RMW + CMA ~ , data = analysis(rolling_origin_spy_2013_2017$splits[[1]]) to fit the five-factor Fama French model to the 100-day sample in our first split.

# Need to load the packages to be used as the random forest engine

rand_forest(mode = "regression", mtry = 3, trees = 100) %>%
  set_engine("ranger") %>%
  fit(daily_returns ~ MKT + SMB + HML + RMW + CMA, 
      data = analysis(rolling_origin_spy_2013_2017$splits[[1]]))
parsnip model object

Ranger result

 ranger::ranger(formula = formula, data = data, mtry = ~3, num.trees = ~100,      num.threads = 1, verbose = FALSE, seed =^5,          1)) 

Type:                             Regression 
Number of trees:                  100 
Sample size:                      100 
Number of independent variables:  5 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       1.514654e-05 
R squared (OOB):                  0.6880896 

Notice that ranger gives us an OOB prediction error (MSE) value as part of its return. parsnip returns to us what the underlying engine returns.

Now, let’s apply that random forest regression to all 1159 of our splits (recall that each split consists of 100 days of training data and one day of test data), so we can get an average RMSE. Warning: this will consume some resources on your machine and some time in your day.

To apply that model to our entire data set, we create a function that takes one split, passes it to our parsnip enabled model, and then uses the predict function to attempt to predict our assessment split. The function also allows us to specify the number of trees and the number of variables randomly sampled at each tree split, which is set with the mtry argument.

ranger_rf_regress <- function(mtry = 3, trees = 5, split){
    analysis_set_rf <- analysis(split)
    model <- 
      rand_forest(mtry = mtry, trees = trees) %>%
        set_engine("ranger") %>%
        fit(daily_returns ~ MKT + SMB + HML + RMW + CMA, data = analysis_set_rf)

    assessment_set_rf <- assessment(split)

    assessment_set_rf %>%
      select(date, daily_returns) %>%
      mutate(.pred = unlist(predict(model, new_data = assessment_set_rf))) %>% 
      select(date, daily_returns, .pred)

Now we want to pass it our object of 1159 splits, rolling_origin_spy_2013_2017$splits, and we want the function to iterate over each split. For that we turn to map_df() from the purrr package, which allows us to iterate over the data object and return a data frame. map_df() takes the data as an argument and our function as an argument.

ranger_results <- 
  map_df(.x = rolling_origin_spy_2013_2017$splits,
         ~ranger_rf_regress(mtry = 3, trees = 100, split = .x))

Here are the results. We now have 1159 predictions.

ranger_results %>% 
# A tibble: 6 x 4
# Groups:   asset [1]
  asset date       daily_returns    .pred
  <chr> <date>             <dbl>    <dbl>
1 SPY   2013-05-28       0.00597  0.00583
2 SPY   2013-05-29      -0.00652 -0.00403
3 SPY   2013-05-30       0.00369  0.00658
4 SPY   2013-05-31      -0.0145  -0.0114 
5 SPY   2013-06-03       0.00549  0.00119
6 SPY   2013-06-04      -0.00482  0.00202

Notice how the date of each prediction is included since we included it in the select() call in our function. That will come in handy for charting later.

Now, we can use the rmse() function from yardstick to calculate the root mean-squared error each of our predictions (our test sets had only one observation in them because we were testing on one month, so the RMSE is not a complex calculation here, but it would be the same code pattern if we had a larger test set). We can then find the average RMSE by calling summarise(avg_rmse = mean(.estimate)).


ranger_results %>%
  group_by(date) %>% 
  rmse(daily_returns, .pred) %>% 
  summarise(avg_rmse = mean(.estimate))
# A tibble: 1 x 1
1  0.00253

We have the average RMSE; let’s see if the RMSE were stable over time, first with ggplot.

ranger_results %>%
  group_by(date) %>% 
  rmse(daily_returns, .pred) %>% 
  ggplot(aes(x = date, y = .estimate)) +
  geom_point(color = "cornflowerblue") +
  labs(y = "rmse", x = "", title = "RMSE over time via Ranger RF")

And with highcharter.

ranger_results %>%
  group_by(date) %>% 
  rmse(daily_returns, .pred) %>% 
  hchart(., hcaes(x = date, y = .estimate),
         type = "point") %>% 
  hc_title(text = "RMSE over time via Ranger RF") %>% 
  hc_yAxis(title = list(text = "RMSE"))

To leave a comment for the author, please follow the link and comment on their blog: R Views. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)