[This article was first published on R - datawookie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m going to be writing a series of posts which will look at some applications of R (and perhaps Python) to financial modelling. We’ll start here by pulling some stock data into R, calculating the daily returns and then looking at correlations and simple volatility estimates.

## Data

I was looking for a chunky set of stock data, preferably something that was available at higher than daily time resolution. These data from Kaggle for NIFTY 100 stocks on the Indian Stock Market seemed a decent fit.

📌 Although I wanted intraday data I won’t be using it for the moment and will start by aggregatig the data up to daily resolution.

## List of Data Frames

Data for each of the stocks is in a separate CSV file. Start by iterating over these using `map_dfr()` and loading them into a single data frame.

```FILES <- list.files("data", pattern=".csv\$", full.names = TRUE)

load <- function(path) {
read_csv(path, show_col_types = FALSE) %>%
mutate(
symbol = sub(".csv\$", "", basename(path)),
time = as.POSIXct(date),
date = as.Date(time)
) %>%
group_by(symbol, date) %>%
arrange(time) %>%
# Summarise the daily closing price.
summarise(
close = last(close),
.groups = "drop"
)
}

stocks <- map_dfr(FILES, load)
```

How many records do we have and what does the data look like?

```nrow(stocks)

[1] 151932

# A tibble: 6 × 3
symbol date       close
<chr>  <date>     <dbl>
1 ACC    2015-02-02 1515.
2 ACC    2015-02-03 1512.
3 ACC    2015-02-04 1490
4 ACC    2015-02-05 1500.
5 ACC    2015-02-06 1513.
6 ACC    2015-02-09 1504.
```

We could work with them en masse, but it’s going to be easier to split into a list of data frames, one for each stock.

```stocks <- split(stocks, stocks\$symbol)
length(stocks)

[1] 87

names(stocks)

[6] "ASIANPAINT" "AUROPHARMA" "AXISBANK"   "BAJAJ-AUTO" "BAJAJFINSV"
[11] "BAJFINANCE" "BANKBARODA" "BERGEPAINT" "BHARTIARTL" "BIOCON"
[16] "BOSCHLTD"   "BPCL"       "BRITANNIA"  "CADILAHC"   "CHOLAFIN"
[21] "CIPLA"      "COALINDIA"  "COLPAL"     "DABUR"      "DIVISLAB"
[26] "DLF"        "DRREDDY"    "EICHERMOT"  "GAIL"       "GODREJCP"
[31] "GRASIM"     "HAVELLS"    "HCLTECH"    "HDFC"       "HDFCBANK"
[36] "HEROMOTOCO" "HINDALCO"   "HINDPETRO"  "HINDUNILVR" "ICICIBANK"
[41] "IGL"        "INDUSINDBK" "INDUSTOWER" "INFY"       "IOC"
[46] "ITC"        "JINDALSTEL" "JSWSTEEL"   "JUBLFOOD"   "KOTAKBANK"
[51] "LT"         "LUPIN"      "M_M"        "MARICO"     "MARUTI"
[56] "MCDOWELL-N" "MUTHOOTFIN" "NAUKRI"     "NESTLEIND"  "NIFTY50"
[61] "NIFTYBANK"  "NMDC"       "NTPC"       "ONGC"       "PEL"
[66] "PIDILITIND" "PIIND"      "PNB"        "POWERGRID"  "RELIANCE"
[71] "SAIL"       "SBIN"       "SHREECEM"   "SIEMENS"    "SUNPHARMA"
[76] "TATACONSUM" "TATAMOTORS" "TATASTEEL"  "TCS"        "TECHM"
[81] "TITAN"      "TORNTPHARM" "ULTRACEMCO" "UPL"        "VEDL"
[86] "WIPRO"      "YESBANK"
```

There are 87 distinct stocks and a number of names there that you might recognise.

## Time Series

Now convert them into individual `xts` objects and concatenate to form a single `xts` object with the closing time series for each of the stocks.

```stocks <- stocks %>%
map(~{
ts <- xts(x = .[["close"]], order.by = .[["date"]])
colnames(ts) <- unique(.\$symbol)

ts
})

stocks <- do.call(merge, stocks) %>% na.omit()
stocks <- stocks["2016-01-01/2022-01-01"]
```

We now have a multivariate time series. How many individual series and how many observations?

```dim(stocks)

[1] 1485   87
```

So that’s 87 time series and 1485 observations for each of them. Each time series looks like this:

```head(stocks\$TATASTEEL)

TATASTEEL
2016-01-04    259.00
2016-01-05    272.95
2016-01-06    268.75
2016-01-07    251.00
2016-01-08    254.55
2016-01-11    252.90
```

We can print the closing prices but it’s more useful to have a plot.

## Returns

For analytical purposes the absolute daily closing price is not as interesting as the daily return, which can be easily calculated but we’ll use `CalculateReturns()` from `{PerformanceAnalytics}` for convenience.

```returns <- CalculateReturns(stocks) %>% na.omit()
```

Plot the corresponding return time series.

## Variability & Volatility

There’s a lot of diversity in the return time series across the collection of stocks. To illustrate, here are the series for a collection of them.

ACC has a roughly consistent low level of return variability over the six years. By contrast, Yes Bank has periods of increased variability starting in the latter part of 2018.

### Daily Volatility

It would be useful if we had a number (or numbers) to describe the relative level of variability. Volatility is a statistical measure of the dispersion of returns and is calculated as either the standard deviation or variance of the returns. In general, the higher the volatility, the riskier the security.

```# Pull out the returns for two specific stocks.
#
ACC <- returns\$ACC
YESBANK <- returns\$YESBANK
```

Let’s start by calculating the standard deviations of those stocks over the entire data period.

```# Daily volatility.
#
sd(ACC)

[1] 0.0184169

sd(YESBANK)

[1] 0.04682746
```

As expected the volatility of Yes Bank is higher than that of ACC.

### Annualised Volatility

These values can be thought of as the average daily variation in the stock’s return. However, what’s more commonly used is the annualised value, which takes into account the fact that there are on average 252 trading days in each year and reflects the average variation in the stock’s return over the course of a year.

```# Annualised volatility.
#
sqrt(252) * sd(ACC)

[1] 0.2923593

sqrt(252) * sd(YESBANK)

[1] 0.7433629
```

### Varying Volatility

It’s instructive to look at the volatility for shorter periods of time. Let’s compare the volatility of those two stocks in 2017 and 2020.

```sqrt(252) * sd(ACC["2017"])

[1] 0.2344968

sqrt(252) * sd(YESBANK["2017"])

[1] 0.2665718
```

In 2017 the volatilities of ACC and Yes Bank are comparable.

```sqrt(252) * sd(ACC["2020"])

[1] 0.3900236

sqrt(252) * sd(YESBANK["2020"])

[1] 1.283764
```

However, in 2020 Yes Bank is substantially more volatile than ACC.

Of course you can make these volatility calculations progressively more granular. However, as time intervals get shorter you have less data, so each of the individual measurements becomes less reliable. We’ll be modelling the time variation of volatility in upcoming posts.

## Outliers

The returns might include significant outliers that would likely bias any derived results. We can use winsorisation to reduce the effect of extreme outliers in the data. The `Return.clean()` function from `{PerformanceAnalytics}` can be used to apply `clean.boudt()` to the data.

```yesbank <- Return.clean(YESBANK, method = "boudt")
```

In the plot below the raw data are in black and the cleaned data are in blue. The extrema are reduced in magnitude in the cleaned data.

This can be appreciated by looking at the range (minimum and maximum values) of the rase and cleaned data.

```range(YESBANK)

[1] -0.5537634  0.5964912

range(yesbank)

[1] -0.1621622  0.1601474
```

## Importance of Volatility

Why is volatility important? It is an fundamental consideration in predicting future returns. Stocks with lower volatility are less likely to have substantial changes in return from day to day. However, stocks with higher volatility are literally more “volatile”: their returns can vary substantially from day to day.

As we have seen, volatility varies from stock to stock (some companies are more volatile than others) and from period to period (a company may have times of low and high volatility).

Volatility needs to be factored into future predictions. Of course, we also don’t know what future volatilities will be, so in addition to predicting future returns we will probably also want to predict future volatilities. We’ll see how this can be done in an upcoming post on GARCH models.