# Outlier Days with R and Python

**R Views**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to another installment of Reproducible Finance. Today’s post will be topical as we look at the historical behavior of the stock market after days of extreme returns and it will also explore one of my favorite coding themes of 2020 – the power of RMarkdown as an R/Python collaboration tool.

This post originated when Rishi Singh, the founder of tiingo and one of the nicest people I have encountered in this crazy world, sent over a note about recent market volatility along with some Python code for analyzing that volatility. We thought it would be a nice project to post that Python code along with the equivalent R code for reproducing the same results. For me, it’s a great opportunity to use `RMarkdown's`

R and Python interoperability superpowers, fueled by the `reticulate`

package. If you are an R coder and someone sends you Python code as part of a project, `RMarkdown`

+ `reticulate`

makes it quite smooth to incorporate that Python code into your work. It was interesting to learn how a very experienced Python coder might tackle a problem and then think about how to tackle that problem with R. Unsurprisingly, I couldn’t resist adding a few elements of data visualization.

Before we get started, if you’re unfamiliar with using R and Python chunks throughout an `RMarkdown`

file, have a quick look at the `reticulate`

documentation here.

Let’s get to it. Since we’ll be working with R and Python, we start with our usual R setup code chunk to load R packages, but we’ll also load the `reticulate`

package and source a Python script. Here’s what that looks like.

library(tidyverse) library(tidyquant) library(riingo) library(timetk) library(plotly) library(roll) library(slider) library(reticulate) riingo_set_token("your tiingo token here") # Python file that holds my tiingo token reticulate::source_python("credentials.py") knitr::opts_chunk$set(message = FALSE, warning = FALSE, comment = NA)

Note that I set my tiingo token twice: first using `riingo_set_token()`

so I can use the `riingo`

package in R chunks and then by sourcing the `credentials.py`

file, where I have put `tiingoToken = 'my token'`

. Now I can use the `tiingoToken`

variable in my Python chunks. This is necessary because we will use both R and Python to pull in data from Tiingo.

Next we will use a Python chunk to load the necessary Python libraries. If you haven’t installed these yet, you can open the RStudio terminal and run `pip install`

. Since we’ll be interspersing R and Python code chunks throughout, I will add a `# Python Chunk`

to each Python chunk and, um, `# R Chunk`

to each R chunk.

# Python chunk import pandas as pd import numpy as np import tiingo

Let’s get to the substance. The goal today is look back at the last 43 years of S&P 500 price history and analyze how the market has performed following a day that sees an extreme return. We will also take care with how we define an extreme return, using rolling volatility to normalize percentage moves.

We will use the mutual fund `VFINX`

as a tradeable proxy for the S&P 500 because it has a much longer history than other funds like `SPY`

or `VOO`

.

Let’s start by passing a URL string from tiingo to the `pandas`

function `read_csv`

, along with our `tiingoToken`

.

# Python chunk pricesDF = pd.read_csv("https://api.tiingo.com/tiingo/daily/vfinx/prices?startDate=1976-1-1&format=csv&token=" + tiingoToken)

We just created a Python object called `pricesDF`

. We can look at that object in an R chunk by calling `py$pricesDF`

.

# R chunk py$pricesDF %>% head()

We just created a Python object called `pricesDF`

. Let’s reformat the date column becomes the index, in date time format.

# Python chunk pricesDF = pricesDF.set_index(['date']) pricesDF.index = pd.DatetimeIndex(pricesDF.index)

Heading back to R for viewing, we see that the date column is no longer a column – it is the index of the data frame and in `pandas`

the index is more like a label than a new column. In fact, here’s what happens when call the row names of this data frame.

# R chunk py$pricesDF %>% head() %>% rownames()

We now have our prices, indexed by date. Let’s convert adjusted closing prices to log returns and save the results in a new column called `returns`

. Note the use of the `shift(1)`

operator here. That is analogous to the `lag(..., 1)`

function in `dplyr`

.

# Python chunk pricesDF['returns'] = np.log(pricesDF['adjClose']/pricesDF['adjClose'].shift(1))

Next, we want to calculate the 3-month rolling standard deviation of these daily log returns, and then divide daily returns by the *previous* rolling 3-month volatility in order to prevent look-ahead error. We can think of this as normalizing today’s return by the previous 3-months’ rolling volatility and will label it as `stdDevMove`

.

# Python chunk pricesDF['rollingVol'] = pricesDF['returns'].rolling(63).std() pricesDF['stdDevMove'] = pricesDF['returns'] / pricesDF['rollingVol'].shift(1)

Finally, we eventually want to calculate how the market has performed on the day following a large negative move. To prepare for that, let’s create a column of next day returns using `shift(-1)`

.

# Python chunk pricesDF['nextDayReturns'] = pricesDF.returns.shift(-1)

Now, we can filter by the size of the `stdDevMove`

column and the `returns`

column, to isolate days where the standard deviation move was at least 3 and the returns was less than -3%. We use `mean()`

to find the mean next day return following such large events.

# Python chunk nextDayPerformanceSeries = pricesDF.loc[(pricesDF['stdDevMove'] < -3) & (pricesDF['returns'] < -.03), ['nextDayReturns']].mean()

Finally, let’s loop through and see how the mean next day return changes as we filter on different extreme negative returns or we can call drop tolerances. We will label the drop tolerance as `i`

, set it at `-.03`

and then run a `while`

loop that decrements down `i`

by .0025 at each pass. In this way we can look at the mean next return following different levels of negative returns.

# Python chunk i = -.03 while i >= -.0525: nextDayPerformanceSeries = pricesDF.loc[(pricesDF['stdDevMove'] < -3) & (pricesDF['returns'] < i), ['nextDayReturns']] print(str(round(i, 5)) + ': ' + str(round(nextDayPerformanceSeries['nextDayReturns'].mean(), 6))) i -= .0025

It appears that as the size of the drop gets larger and more negative, the mean bounce back tends to get larger.

Let’s reproduce these results in R.

First, we import prices using the `riingo_prices()`

function from the riingo.

# R chunk sp_500_prices <- "VFINX" %>% riingo_prices(start_date = "1976-01-01", end_date = today())

We can use `mutate()`

to add a column of daily returns, rolling volatility, standard deviation move and next day returns.

# R chunk sp_500_returns <- sp_500_prices %>% select(date, adjClose) %>% mutate(daily_returns_log = log(adjClose/lag(adjClose)), rolling_vol = roll_sd(as.matrix(daily_returns_log), 63), sd_move = daily_returns_log/lag(rolling_vol), next_day_returns = lead(daily_returns_log))

Now let’s `filter()`

on an `sd_move`

greater than 3 and `daily_returns_log`

less than a drop tolerance of -.03.

# R chunk sp_500_returns %>% na.omit() %>% filter(sd_move < -3 & daily_returns_log < -.03) %>% select(date, daily_returns_log, sd_move, next_day_returns) %>% summarise(mean_return = mean(next_day_returns)) %>% add_column(drop_tolerance = scales::percent(.03), .before = 1)

# A tibble: 1 x 2 drop_tolerance mean_return <chr> <dbl> 1 3% 0.00625

We used a `while()`

loop to iterate across different drop tolerances in Python, let’s see how to implement that using `map_dfr()`

from the `purrr`

package.

First, we will define a sequence of drop tolerances using the `seq()`

function.

# R chunk drop_tolerance <- seq(.03, .05, .0025) drop_tolerance

[1] 0.0300 0.0325 0.0350 0.0375 0.0400 0.0425 0.0450 0.0475 0.0500

Next, we will create a function called `outlier_mov_fun`

that takes a data frame of returns, filters on a drop tolerance and gives us the mean return following large negative moves.

# R chunk outlier_mov_fun <- function(drop_tolerance, returns) { returns %>% na.omit() %>% filter(sd_move < -3 & daily_returns_log < -drop_tolerance) %>% select(date, daily_returns_log, sd_move, next_day_returns) %>% summarise(mean_return = mean(next_day_returns) %>% round(6)) %>% add_column(drop_tolerance = scales::percent(drop_tolerance), .before = 1) %>% add_column(drop_tolerance_raw = drop_tolerance, .before = 1) }

Notice how that function takes two arguments: a drop tolerance and data frame of returns.

Next, we pass our sequence of drop tolerances, stored in a variable called `drop_tolerance`

to `map_dfr()`

, along with our function and our `sp_500_returns`

object. `map_dfr`

will iterate through our sequence of drops and apply our function to each one.

# R chunk map_dfr(drop_tolerance, outlier_mov_fun, sp_500_returns) %>% select(-drop_tolerance_raw)

# A tibble: 9 x 2 drop_tolerance mean_return <chr> <dbl> 1 3% 0.00625 2 3% 0.00700 3 3% 0.00967 4 4% 0.0109 5 4% 0.0122 6 4% 0.0132 7 4% 0.0149 8 5% 0.0149 9 5% 0.0162

Have a quick glance up that the results of our Python `while()`

and we should see that the results are consistent.

Alright, let’s have some fun and get to visualizing these results with `ggplot`

and `plotly`

.

# R chunk ( sp_500_returns %>% map_dfr(drop_tolerance, outlier_mov_fun, .) %>% ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance} mean next day return: {mean_return * 100}%"))) + geom_point(color = "cornflowerblue") + labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") + scale_x_continuous(labels = scales::percent) + scale_y_continuous(labels = scales::percent) + theme_minimal() ) %>% ggplotly(tooltip = "text")

Here’s what happens when we expand the upper bound to a drop tolerance of -2% and make our intervals smaller, moving from .25% increments to .125% increments.

# R chunk drop_tolerance_2 <- seq(.02, .05, .00125) ( sp_500_returns %>% map_dfr(drop_tolerance_2, outlier_mov_fun, .) %>% ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance} mean next day return: {mean_return * 100}%"))) + geom_point(color = "cornflowerblue") + labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") + scale_x_continuous(labels = scales::percent) + scale_y_continuous(labels = scales::percent) + theme_minimal() ) %>% ggplotly(tooltip = "text")

Check out what happens when we expand the lower bound, to a -6% drop tolerance.

# R chunk drop_tolerance_3 <- seq(.02, .06, .00125) ( sp_500_returns %>% map_dfr(drop_tolerance_3, outlier_mov_fun, .) %>% ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance} mean next day return: {mean_return * 100}%"))) + geom_point(color = "cornflowerblue") + labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") + scale_x_continuous(labels = scales::percent) + scale_y_continuous(labels = scales::percent) + theme_minimal() ) %>% ggplotly(tooltip = "text")

I did not expect that gap upward when the daily drop passes 5.25%.

A quick addendum that if I had gotten my act together and finished this 4 days ago I would not have included, but I’m curious how this last week has compared with other weeks in terms of volatility. I have in mind to visualize weekly return dispersion and that seemed a mighty tall task, until the brand new `slider`

package came to the rescue! `slider`

has a function called `slide_period()`

that, among other things, allows us to break up time series according to different periodicities.

To break up our returns by week, we call `slide_period_dfr(., .$date, "week", ~ .x, .origin = first_monday_december, .names_to = "week")`

, where `first_monday_december`

is a date that falls on a Monday. We could use our eyeballs to check a calendar and find a date that’s a Monday or we could use some good ol’ code. Let’s assume we want to find the first Monday in December of 2016.

We first filter our data with `filter(between(date, as_date("2016-12-01"), as_date("2016-12-31")))`

. Then create a column of weekday names with `wday(date, label = TRUE, abbr = FALSE)`

and filter to our first value of “Monday”.

# R Chunk first_monday_december <- sp_500_returns %>% mutate(date = ymd(date)) %>% filter(between(date, as_date("2016-12-01"), as_date("2016-12-31"))) %>% mutate(day_week = wday(date, label = TRUE, abbr = FALSE)) %>% filter(day_week == "Monday") %>% slice(1) %>% pull(date)

Now we run our `slide_period_dfr()`

code and it will start on the first Monday in December of 2016, and break our returns into weeks. Since we set `.names_to = "week"`

, the function will create a new column called `week`

and give a unique number to each of our weeks.

# R chunk sp_500_returns %>% select(date, daily_returns_log) %>% filter(date >= first_monday_december) %>% slide_period_dfr(., .$date, "week", ~ .x, .origin = first_monday_december, .names_to = "week") %>% head(10)

# A tibble: 10 x 3 week date daily_returns_log <int> <dttm> <dbl> 1 1 2016-12-05 00:00:00 0.00589 2 1 2016-12-06 00:00:00 0.00342 3 1 2016-12-07 00:00:00 0.0133 4 1 2016-12-08 00:00:00 0.00226 5 1 2016-12-09 00:00:00 0.00589 6 2 2016-12-12 00:00:00 -0.00105 7 2 2016-12-13 00:00:00 0.00667 8 2 2016-12-14 00:00:00 -0.00810 9 2 2016-12-15 00:00:00 0.00392 10 2 2016-12-16 00:00:00 -0.00172

From here, we can `group_by`

that `week`

column and treat each week as a discrete time period. Let’s use `ggplotly`

to plot each week on the x-axis and the daily returns of each week on the y-axis, so that the vertical dispersion shows us the dispersion of weekly returns. Hover on the point to see the exact date of the return.

# R chunk ( sp_500_returns %>% select(date, daily_returns_log) %>% filter(date >= first_monday_december) %>% slide_period_dfr(., .$date, "week", ~ .x, .origin = first_monday_december, .names_to = "week") %>% group_by(week) %>% mutate(start_week = ymd(min(date))) %>% ggplot(aes(x = start_week, y = daily_returns_log, text = str_glue("date: {date}"))) + geom_point(color = "cornflowerblue", alpha = .5) + scale_y_continuous(labels = scales::percent, breaks = scales::pretty_breaks(n = 8)) + scale_x_date(breaks = scales::pretty_breaks(n = 10)) + labs(y = "", x = "", title = "Weekly Daily Returns") + theme_minimal() ) %>% ggplotly(tooltip = "text")

We can also plot the standard deviation of returns for each week.

# R chunk ( sp_500_returns %>% select(date, daily_returns_log) %>% filter(date >= first_monday_december) %>% slide_period_dfr(., .$date, "week", ~ .x, .origin = first_monday_december, .names_to = "week") %>% group_by(week) %>% summarise(first_of_week = first(date), sd = sd(daily_returns_log)) %>% ggplot(aes(x = first_of_week, y = sd, text = str_glue("week: {first_of_week}"))) + geom_point(aes(color = sd)) + labs(x = "", title = "Weekly Standard Dev of Returns", y = "") + theme_minimal() ) %>% ggplotly(tooltip = "text")

That’s all for today! Thanks for reading and stay safe out there.

**leave a comment**for the author, please follow the link and comment on their blog:

**R Views**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.