Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In two previous posts, we calculated and then visualized the CAPM beta of a portfolio by fitting a simple linear model.

Today, we move beyond CAPM’s simple linear regression and explore the Fama French (FF) multi-factor model of equity risk/return. For more background, have a look at the original article published in The Journal Financial Economics, Common risk factors in the returns on stocks and bonds.

The FF model extends CAPM by regressing portfolio returns on several variables, in addition to market returns. From a general data science point of view, FF extends CAPM’s simple linear regression, where we had one independent variable, to a multiple linear regression, where we have numerous independent variables.

We are going to look at the FF 3-factor model, which tests the explanatory power of (1) market returns (same as CAPM), (2) firm size (small versus big) and (3) firm value (book to market ratio). The `firm value` factor is labeled as `HML` in FF, which stands for high-minus-low and refers to a firm’s book-to-market ratio. When we regress portfolio returns on the `HML` factor, we are investigating how much of the returns are the result of including stocks with a high book-to-market ratio (sometimes called the `value premium`, because high book-to-market stocks are called value stocks).

A large portion of this post covers importing data from the FF website and wrangling it for use with our portfolio returns. We will see that wrangling the data is conceptually easy to understand but practically time-consuming to implement. However, mashing together data from disparate sources is a necessary skill for anyone in industry that has data streams from different vendors and wants to get creative about how to use them. Once the data is wrangled, fitting the model is not time-consuming.

Today, we will be working with our usual portfolio consisting of:

```+ SPY (S&P500 fund) weighted 25%
+ EFA (a non-US equities fund) weighted 25%
+ IJS (a small-cap value fund) weighted 20%
+ EEM (an emerging-mkts fund) weighted 20%
+ AGG (a bond fund) weighted 10%```

Before we can calculate beta for that portfolio, we need to find portfolio monthly returns, which was covered in my previous post, Introduction to Portfolio Returns. I won’t go through the logic again, but the code is here:

```library(tidyquant)
library(tidyverse)
library(timetk)
library(broom)
library(glue)

symbols <- c("SPY","EFA", "IJS", "EEM","AGG")

prices <-
getSymbols(symbols, src = 'yahoo',
from = "2012-12-31",
to = "2017-12-31",
auto.assign = TRUE, warnings = FALSE) %>%
reduce(merge) %>%
`colnames<-`(symbols)

w <- c(0.25, 0.25, 0.20, 0.20, 0.10)

asset_returns_long <-
prices %>%
to.monthly(indexAt = "lastof", OHLC = FALSE) %>%
tk_tbl(preserve_index = TRUE, rename_index = "date") %>%
gather(asset, returns, -date) %>%
group_by(asset) %>%
mutate(returns = (log(returns) - log(lag(returns)))) %>%
na.omit()

portfolio_returns_tq_rebalanced_monthly <-
asset_returns_long %>%
tq_portfolio(assets_col  = asset,
returns_col = returns,
weights     = w,
col_rename  = "returns",
rebalance_on = "months")```

We will be working with one object of portfolio returns:

`+ portfolio_returns_tq_rebalanced_monthly`

Let’s get to it.

### Importing and Wrangling the Fama French Factors

Our first task is to get the FF data and, fortunately, FF make their factor data available on the internet. We will document each step for importing and cleaning this data, to an extent that might be overkill. Frustrating for us now, but a time-saver later when we need to update this model or extend to the 5-factor case.

Have a look at the FF website. The data are packaged as zip files, so we will need to do a bit more than call `read_csv()`. Let’s use the `tempfile()` function from base R to create a variable called `temp`. This is where we will put the zipped file.

`temp <- tempfile()`

R has created a temporary file called `temp` that will be cleaned up when we exit this session. Download 3-factor zip with this link. We want to pass that to `download.file()` and store the result in `temp`.

First, though, we will break that string into three pieces: `base`, `factor` and `format` – this is not necessary for today’s task, but it will come in handy if we want to build a Shiny application to let a user choose a factor from the FF website, or if we just want to re-run this analysis with a different set of FF factors. We will then `glue()` those together and save the string as `full_url`.

```base <-
"http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/"

factor <-
"Global_3_Factors"

format<-
"_CSV.zip"

full_url <-
glue(base,
factor,
format,
sep ="")```

Now we pass `full_url` to `download.file()`.

```download.file(
full_url,
temp,
quiet = TRUE)```

Finally, we can read the csv file using `read_csv()` after unzipping that data with the `unz()` function.

```Global_3_Factors <-
"Global_3_Factors.csv"))

```## # A tibble: 6 x 1
##   `This file was created using the 201802 Bloomberg database.`
##   <chr>
## 1 Missing data are indicated by -99.99.
## 2 <NA>
## 3 199007
## 4 199008
## 5 199009
## 6 199010```

We have imported the dataset, but we do not see any factors, just a column with weirdly formatted dates

When this occurs, it often can be fixed by skipping a certain number of rows that contain metadata. Have a look at what happens if we skip 6 rows.

```Global_3_Factors <-
"Global_3_Factors.csv"),
skip = 6)

```## # A tibble: 6 x 5
##   X1     `Mkt-RF` SMB   HML   RF
##   <chr>  <chr>    <chr> <chr> <chr>
## 1 199007 0.86     0.77  -0.25 0.68
## 2 199008 -10.82   -1.60 0.60  0.66
## 3 199009 -11.98   1.23  0.81  0.60
## 4 199010 9.57     -7.39 -4.25 0.68
## 5 199011 -3.86    1.22  1.14  0.57
## 6 199012 1.10     -0.79 -1.60 0.60```

This is what were were expecting, 5 columns: one called `X1` that holds the weirdly formatted dates, then `Mkt-Rf` for the market returns above the risk-free rate, `SMB` for the size factor, `HML` for the value factor, and `RF` for the risk-free rate.

However, the data have been coerced to a character format – look at the class of each column:

`map(Global_3_Factors, class)`

```## \$X1
##  "character"
##
## \$`Mkt-RF`
##  "character"
##
## \$SMB
##  "character"
##
## \$HML
##  "character"
##
## \$RF
##  "character"```

We have two options for coercing those columns to the right format. First, we can do so upon import, by supplying the argument `col_types = cols(col_name = col_double(),...` for each numeric column.

```Global_3_Factors <-
"Global_3_Factors.csv"),
skip = 6,
col_types = cols(
`Mkt-RF` = col_double(),
SMB = col_double(),
HML = col_double(),
RF = col_double()))

```## # A tibble: 6 x 5
##   X1     `Mkt-RF`    SMB    HML    RF
##   <chr>     <dbl>  <dbl>  <dbl> <dbl>
## 1 199007    0.860  0.770 -0.250 0.680
## 2 199008  -10.8   -1.60   0.600 0.660
## 3 199009  -12.0    1.23   0.810 0.600
## 4 199010    9.57  -7.39  -4.25  0.680
## 5 199011   -3.86   1.22   1.14  0.570
## 6 199012    1.10  -0.790 -1.60  0.600```

That works well, but it’s specific to the FF 3-factor set with those specific column names. If we imported a different FF factor set, we would need to specify different column names.

As an alternate approach, the code chunk below converts the columns to numeric after import, but is more general. It can be applied to other FF factor collections.

To do this, we rename the `X1` column to `date`, and then use the `dplyr` verb `mutate_at(vars(-date), as.numeric)` to change our column formats to numeric. The `vars()` function operates like the `select()` function in that we can tell it to operate on all columns except the `date` column by putting a negative sign in front of `date`. This column coercion flow is more flexible in that it would work for different FF factor sets.

```Global_3_Factors <-
"Global_3_Factors.csv"),
skip = 6) %>%
rename(date = X1) %>%
mutate_at(vars(-date), as.numeric)

```## # A tibble: 6 x 5
##   date   `Mkt-RF`    SMB    HML    RF
##   <chr>     <dbl>  <dbl>  <dbl> <dbl>
## 1 199007    0.860  0.770 -0.250 0.680
## 2 199008  -10.8   -1.60   0.600 0.660
## 3 199009  -12.0    1.23   0.810 0.600
## 4 199010    9.57  -7.39  -4.25  0.680
## 5 199011   -3.86   1.22   1.14  0.570
## 6 199012    1.10  -0.790 -1.60  0.600```

We now have numeric data for our factors and the date column has a better label, but the wrong format.

We can use the `lubridate` package to parse that date string into a nicer date format. We will use the `parse_date_time()` function, and call the `ymd()` function to make sure the end result is in a date format. Again, when working with data from a new source, the date and, indeed, any column can come in many formats.

```Global_3_Factors <-
skip = 6) %>%
rename(date = X1) %>%
mutate_at(vars(-date), as.numeric) %>%
mutate(date =
ymd(parse_date_time(date, "%Y%m")))

```## # A tibble: 6 x 5
##   date       `Mkt-RF`    SMB    HML    RF
##   <date>        <dbl>  <dbl>  <dbl> <dbl>
## 1 1990-07-01    0.860  0.770 -0.250 0.680
## 2 1990-08-01  -10.8   -1.60   0.600 0.660
## 3 1990-09-01  -12.0    1.23   0.810 0.600
## 4 1990-10-01    9.57  -7.39  -4.25  0.680
## 5 1990-11-01   -3.86   1.22   1.14  0.570
## 6 1990-12-01    1.10  -0.790 -1.60  0.600```

The date format looks good, and that matters because we want to trim the factor data that the FF dates match our portfolio dates. However, notice that FF uses the first of the month and our portfolio returns use the last of the month. Again, it’s `lubridate` to the rescue with the `rollback()` function. This will roll monthly dates back to the last day of the previous month. The first date in our FF data is “1990-07-01”. Let’s roll it back.

```Global_3_Factors %>%
select(date) %>%
mutate(date = lubridate::rollback(date)) %>%

```## # A tibble: 1 x 1
##   date
##   <date>
## 1 1990-06-30```

If we want to reset our dates to the last of the month, we need to add one first, then rollback.

```Global_3_Factors %>%
select(date) %>%
mutate(date = lubridate::rollback(date + months(1))) %>%

```## # A tibble: 1 x 1
##   date
##   <date>
## 1 1990-07-31```

There are other ways we could have gotten around this issue – most notably, way back in the beginning, we could have indexed our portfolio returns to `indexAt = firstof` – but it was a good chance to introduce the `rollback()` function, and we will not always have that option. Sometimes two data sets are thrown at us and we have to wrangle them from there.

Finally, we want only the FF factor data that aligns with our portfolio data, so we `filter()` by the `first()` and `last()` date in our portfolio returns object.

```Global_3_Factors <-
skip = 6) %>%
rename(date = X1) %>%
mutate_at(vars(-date), as.numeric) %>%
mutate(date =
rollback(ymd(parse_date_time(date, "%Y%m") + months(1)))) %>%
filter(date >=
first(portfolio_returns_tq_rebalanced_monthly\$date) & date <=
last(portfolio_returns_tq_rebalanced_monthly\$date))

```## # A tibble: 3 x 5
##   date       `Mkt-RF`   SMB    HML    RF
##   <date>        <dbl> <dbl>  <dbl> <dbl>
## 1 2013-01-31    5.46  0.140  2.01     0.
## 2 2013-02-28    0.100 0.330 -0.780    0.
## 3 2013-03-31    2.29  0.830 -2.03     0.```

`tail(Global_3_Factors, 3)`

```## # A tibble: 3 x 5
##   date       `Mkt-RF`    SMB    HML     RF
##   <date>        <dbl>  <dbl>  <dbl>  <dbl>
## 1 2017-10-31     1.80 -0.850 -0.950 0.0900
## 2 2017-11-30     1.93 -0.680 -0.260 0.0800
## 3 2017-12-31     1.38  0.940  0.140 0.0900```

All that work enables us to merge these data objects together with `left_join(...by = "date")`. We also convert the FF data to decimal and create a new column called `R_excess` to hold our returns above the risk-free rate.

```ff_portfolio_returns <-
portfolio_returns_tq_rebalanced_monthly %>%
left_join(Global_3_Factors, by = "date") %>%
mutate(MKT_RF = Global_3_Factors\$`Mkt-RF`/100,
SMB = Global_3_Factors\$SMB/100,
HML = Global_3_Factors\$HML/100,
RF = Global_3_Factors\$RF/100,
R_excess = round(returns - RF, 4))

```## # A tibble: 4 x 8
##   date         returns `Mkt-RF`      SMB      HML    RF  MKT_RF  R_excess
##   <date>         <dbl>    <dbl>    <dbl>    <dbl> <dbl>   <dbl>     <dbl>
## 1 2013-01-31  0.0308      5.46   0.00140  0.0201     0. 0.0546   0.0308
## 2 2013-02-28 -0.000870    0.100  0.00330 -0.00780    0. 0.00100 -0.000900
## 3 2013-03-31  0.0187      2.29   0.00830 -0.0203     0. 0.0229   0.0187
## 4 2013-04-30  0.0206      3.02  -0.0121   0.00960    0. 0.0302   0.0206```

We now we have one object with our portfolio returns and FF factors, and can proceed to the simplest part of our exercise from a coding perspective, and the only part that our bosses/colleagues/clients/investors will care about: the modeling and visualization.

Luckily, we can copy/paste the flow from our CAPM work, now that we have the data in a nice format. CAPM used simple linear regression, whereas FF uses multiple regression with many independent variables. Accordingly, our 3-factor FF equation is `lm(R_excess ~ MKT_RF + SMB + HML`.

We will make one addition to the CAPM code flow, which is to include the 95% confidence interval for our coefficients. We do that by setting `tidy(model, conf.int = T, conf.level = .95)`.

```ff_dplyr_byhand <-
ff_portfolio_returns %>%
do(model =
lm(R_excess ~ MKT_RF + SMB + HML,
data = .)) %>%
tidy(model, conf.int = T, conf.level = .95)

ff_dplyr_byhand %>%
mutate_if(is.numeric, funs(round(., 3))) %>%
select(-statistic)```

```##          term estimate std.error p.value conf.low conf.high
## 1 (Intercept)   -0.001     0.001   0.195   -0.004     0.001
## 2      MKT_RF    0.894     0.036   0.000    0.823     0.965
## 3         SMB    0.056     0.076   0.458   -0.095     0.208
## 4         HML    0.030     0.061   0.629   -0.092     0.151```

Our model object now contains a `conf.high` and `conf.low` column to hold our confidence interval min and max values.

We can pipe these results to `ggplot()` and create a scatter of coefficients with confidence intervals. I don’t want to plot the intercept so will filter it out of the code flow.

We add the confidence intervals with `geom_errorbar(aes(ymin = conf.low, ymax = conf.high))`.

```ff_dplyr_byhand %>%
mutate_if(is.numeric, funs(round(., 3))) %>%
filter(term != "(Intercept)") %>%
ggplot(aes(x = term, y = estimate, shape = term, color = term)) +
geom_point() +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high)) +
labs(title = "FF 3-Factor Coefficients for Our Portfolio",
subtitle = "nothing in this post is investment advice",
x = "",
y = "coefficient",
caption = "data source: Fama French website and yahoo! Finance") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
plot.caption  = element_text(hjust = 0))``` The results here are predictable because, as with CAPM, we are regressing a portfolio that contains the market on 3 factors, one of which is the market. Thus, the market factor dominates this model and the other two factors contain zero in their confidence bands.

Next time we will complicate this work by calculating FF factor coefficients for multiple funds/portfolios, examining rolling R squared’s and visualizing the results. See you then!