# Parsnipping Fama French

**R Views**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today, we will continue our exploration of developments in the world of tidy models, and we will stick with our usual Fama French modeling flow to do so. For new readers who want get familiar with Fama French before diving into this post, see here where we covered importing and wrangling the data, here where we covered rolling models and visualization, and here where we covered managing many models. If you’re into Shiny, this flexdashboard might be of interest, as well.

Let’s get to it.

First, we need our data and, as usual, we’ll import data for daily prices of five ETFs, convert them to returns (have a look here for a refresher on that code flow), then import the five Fama French factor data and join it to our five ETF returns data. Here’s the code to make that happen (this code was covered in detail in this post:

symbols <- c("SPY", "EFA", "IJS", "EEM", "AGG") # The prices object will hold our daily price data. prices <- getSymbols(symbols, src = 'yahoo', from = "2012-12-31", to = "2017-12-31", auto.assign = TRUE, warnings = FALSE) %>% map(~Ad(get(.))) %>% reduce(merge) %>% `colnames<-`(symbols) asset_returns_long <- prices %>% tk_tbl(preserve_index = TRUE, rename_index = "date") %>% gather(asset, prices, -date) %>% group_by(asset) %>% mutate(daily_returns = (log(prices) - log(lag(prices)))) %>% na.omit() factors_data_address <- "http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Global_5_Factors_Daily_CSV.zip" factors_csv_name <- "Global_5_Factors_Daily.csv" temp <- tempfile() download.file( # location of file to be downloaded factors_data_address, # where we want R to store that file temp, quiet = TRUE) Global_5_Factors <- read_csv(unz(temp, factors_csv_name), skip = 6 ) %>% rename(date = X1, MKT = `Mkt-RF`) %>% mutate(date = ymd(parse_date_time(date, "%Y%m%d")))%>% mutate_if(is.numeric, funs(. / 100)) %>% select(-RF) data_joined_tidy <- asset_returns_long %>% left_join(Global_5_Factors, by = "date") %>% na.omit()

For today, let’s work with just the `SPY`

data by filtering our data set by asset.

spy_2013_2017 <- data_joined_tidy %>% filter(asset == "SPY")

Next, we re-sample this five years’ worth of data into smaller subsets of training and testing sets. This is frequently done by k-fold cross validation (see here for an example), where random samples are taken from the data, but since we are working with time series, we will use a time-aware technique. The `rsample`

package has a function for exactly this purpose, the `rolling_origin()`

function. We covered this process extensively in this previous post. Here’s the code to make it happen.

rolling_origin_spy_2013_2017 <- rolling_origin( data = spy_2013_2017, initial = 100, assess = 1, cumulative = FALSE ) rolling_origin_spy_2013_2017 %>% dim()

[1] 1159 2

We now have a data object called `rolling_origin_spy_2013_2017`

that holds 1159 `splits`

of data. Each split consists of an analysis data set with 100 days of return and factor data, and an assessment data set with one day of return and factor data.

Now, we can start using that collection of data splits to fit a model on the assessment data, and then test our model on the assessment data. That means it’s time to introduce a relatively new addition to the R tool chain, the `parsnip`

package.

`parsnip`

is a unified model interface that allows us to create a model specification, set an analytic engine, and then fit a model. It’s a ‘unified’ interface in the sense that we can use the same scaffolding but insert different models, or different engines, or different modes. Let’s see how that works with linear regression.

Recall that in the previous post, we piped our data into a linear model like so:

analysis(rolling_origin_spy_2013_2017$splits[[1]]) %>% do(model = lm(daily_returns ~ MKT + SMB + HML + RMW + CMA, data = .)) %>% tidy(model)

# A tibble: 6 x 6 # Groups: asset [1] asset term estimate std.error statistic p.value <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 SPY (Intercept) 0.000579 0.000338 1.71 8.98e- 2 2 SPY MKT 0.909 0.0739 12.3 2.79e-21 3 SPY SMB -0.495 0.112 -4.43 2.52e- 5 4 SPY HML -0.609 0.208 -2.92 4.38e- 3 5 SPY RMW -0.591 0.259 -2.28 2.47e- 2 6 SPY CMA -0.395 0.206 -1.92 5.81e- 2

Now, we will pipe into the `parsnip`

scaffolding, which will allow us to quickly change to a different model and specification further down in the code.

Since we are running a linear regression, we first create a specification with `linear_reg()`

, then set the engine with `set_engine("lm")`

, and finally fit the model with `fit(five_factor_model, data = one of our splits)`

lm_model <- linear_reg() %>% set_engine("lm") %>% fit(daily_returns ~ MKT + SMB + HML + RMW + CMA, data = analysis(rolling_origin_spy_2013_2017$splits[[1]])) lm_model

parsnip model object Call: stats::lm(formula = formula, data = data) Coefficients: (Intercept) MKT SMB HML RMW 0.0005794 0.9086303 -0.4951297 -0.6085088 -0.5910375 CMA -0.3954515

Now that we’ve fit the model on our test set, let’s see how well it predicted the test set. We can use the `predict()`

function and pass it the results of our `parnsip`

code flow, along with the `assessment`

split.

assessment(rolling_origin_spy_2013_2017$splits[[1]]) %>% select(returns) %>% bind_cols(predict(lm_model, new_data = assessment(rolling_origin_spy_2013_2017$splits[[1]])))

# A tibble: 1 x 3 # Groups: asset [1] asset returns .pred <chr> <dbl> <dbl> 1 SPY 148. 0.00737

That worked well, but now let’s head to a more complex model and use the `ranger`

package as an engine for a random forest analysis.

To set up the ranger random forest model in `parsnip`

, we first use `rand_forest(mode = "regression", mtry = 3, trees = 100)`

to create the specification, `set_engine("ranger")`

to set the engine as the `ranger`

package, and `fit(daily_returns ~ MKT + SMB + HML + RMW + CMA ~ , data = analysis(rolling_origin_spy_2013_2017$splits[[1]])`

to fit the five-factor Fama French model to the 100-day sample in our first split.

# Need to load the packages to be used as the random forest engine library(ranger) rand_forest(mode = "regression", mtry = 3, trees = 100) %>% set_engine("ranger") %>% fit(daily_returns ~ MKT + SMB + HML + RMW + CMA, data = analysis(rolling_origin_spy_2013_2017$splits[[1]]))

parsnip model object Ranger result Call: ranger::ranger(formula = formula, data = data, mtry = ~3, num.trees = ~100, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) Type: Regression Number of trees: 100 Sample size: 100 Number of independent variables: 5 Mtry: 3 Target node size: 5 Variable importance mode: none Splitrule: variance OOB prediction error (MSE): 1.514654e-05 R squared (OOB): 0.6880896

Notice that `ranger`

gives us an `OOB prediction error (MSE)`

value as part of its return. `parsnip`

returns to us what the underlying engine returns.

Now, let’s apply that random forest regression to all 1159 of our splits (recall that each split consists of 100 days of training data and one day of test data), so we can get an average RMSE. Warning: this will consume some resources on your machine and some time in your day.

To apply that model to our entire data set, we create a function that takes one split, passes it to our `parsnip`

enabled model, and then uses the `predict`

function to attempt to predict our `assessment`

split. The function also allows us to specify the number of trees and the number of variables randomly sampled at each tree split, which is set with the `mtry`

argument.

ranger_rf_regress <- function(mtry = 3, trees = 5, split){ analysis_set_rf <- analysis(split) model <- rand_forest(mtry = mtry, trees = trees) %>% set_engine("ranger") %>% fit(daily_returns ~ MKT + SMB + HML + RMW + CMA, data = analysis_set_rf) assessment_set_rf <- assessment(split) assessment_set_rf %>% select(date, daily_returns) %>% mutate(.pred = unlist(predict(model, new_data = assessment_set_rf))) %>% select(date, daily_returns, .pred) }

Now we want to pass it our object of 1159 splits, `rolling_origin_spy_2013_2017$splits`

, and we want the function to iterate over each split. For that we turn to `map_df()`

from the `purrr`

package, which allows us to iterate over the data object and return a data frame. `map_df()`

takes the data as an argument and our function as an argument.

ranger_results <- map_df(.x = rolling_origin_spy_2013_2017$splits, ~ranger_rf_regress(mtry = 3, trees = 100, split = .x))

Here are the results. We now have 1159 predictions.

ranger_results %>% head()

# A tibble: 6 x 4 # Groups: asset [1] asset date daily_returns .pred <chr> <date> <dbl> <dbl> 1 SPY 2013-05-28 0.00597 0.00583 2 SPY 2013-05-29 -0.00652 -0.00403 3 SPY 2013-05-30 0.00369 0.00658 4 SPY 2013-05-31 -0.0145 -0.0114 5 SPY 2013-06-03 0.00549 0.00119 6 SPY 2013-06-04 -0.00482 0.00202

Notice how the date of each prediction is included since we included it in the `select()`

call in our function. That will come in handy for charting later.

Now, we can use the `rmse()`

function from `yardstick`

to calculate the root mean-squared error each of our predictions (our test sets had only one observation in them because we were testing on one month, so the RMSE is not a complex calculation here, but it would be the same code pattern if we had a larger test set). We can then find the average RMSE by calling `summarise(avg_rmse = mean(.estimate))`

.

library(yardstick) ranger_results %>% group_by(date) %>% rmse(daily_returns, .pred) %>% summarise(avg_rmse = mean(.estimate))

# A tibble: 1 x 1 avg_rmse <dbl> 1 0.00253

We have the average RMSE; let’s see if the RMSE were stable over time, first with `ggplot`

.

ranger_results %>% group_by(date) %>% rmse(daily_returns, .pred) %>% ggplot(aes(x = date, y = .estimate)) + geom_point(color = "cornflowerblue") + labs(y = "rmse", x = "", title = "RMSE over time via Ranger RF")

And with `highcharter`

.

ranger_results %>% group_by(date) %>% rmse(daily_returns, .pred) %>% hchart(., hcaes(x = date, y = .estimate), type = "point") %>% hc_title(text = "RMSE over time via Ranger RF") %>% hc_yAxis(title = list(text = "RMSE"))

**leave a comment**for the author, please follow the link and comment on their blog:

**R Views**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.