(This article was first published on

**Fear and Loathing in Data Science**, and kindly contributed to R-bloggers)When I first learned about Granger-causality this past February, I was bemused and quite skeptical of the whole procedure. I felt it belonged on the scrapheap of impractical academic endeavors, preferring to possibly use an ARIMA transfer function model for the same task. However, several contemporaries threw the red challenge flag and upon further review, my initial impressions have been overturned. Not only am I fascinated by the technique, in my attempt to discover its value I have became a raving R fan. As such, my first blog entry is to provide some simple code to allow anyone to utilize this obscure econometric technique, but first some background.

Given two sets of time series data, x and y, granger-causality is a method which attempts to determine whether one series is likely to influence change in the other. This is accomplished by taking different lags of one series and using that to model the change in the second series. We create two models which predict y, one with only past values of y (Ω), and the other with past values of y and x (π). The models are given below where k is the number of lags in the time series:

Let Ω = y

_{t}= β_{0}+ β_{1}y_{t-1}+…+ β_{k}y_{t-k }+ eAnd π = y

_{t}= β_{0}+ β_{1}y_{t-1}+…+ β_{k}y_{t-k }+ α_{1}x_{t-1}+…+ α_{k}x_{t-k}+ eThe residual sums of squared errors are then compared and a test is used to determine whether the nested model (Ω) is adequate to explain the future values of y or if the full model (π) is better. The F-test, t-test or Wald test (used in R) are calculated to test the following null and alternate hypotheses:

H

_{0}: α_{i}= 0 for each i of the element [1,k]H

_{1}: α_{i}≠ 0 for at least 1 i of the element [1,k]Essentially, we are trying to determine whether we can say that statistically x provides more information about future values of y than past values of y alone. Under this definition it is clear that we are not trying to prove actual causation, only that the two values are related by some phenomenon. Along those lines, we must also run this model in reverse to verify that that y does not provide information about future values of x. If we find that this is the case, it is likely that there is some exogenous variable, z, which needs to be controlled or could be a better candidate for granger causation.

For a detailed explanation, one can read the original paper on the subject:

Granger, Clive W., “Investigating Causal Relations by Econometric Models and Cross-Spectral Methods”,

*Econometrica,*37(1969): 424-38The R package “lmtest” incorporates the granger causality procedure, including a data set to answer the age old question of what came first, “the chicken or the egg”. The data was presented by Walter Thurman and Mark Fisher in the American Journal of Agricultural Economics, May 1988, titled “Chickens, Eggs, and Causality, or Which Came First?” It consists of two time series from 1930 to 1983, one of U.S. egg production and the other the estimated U.S. chicken population.

Let’s get some code going, first loading the data from a saved .csv file.

> chickegg <- read.csv(file.choose())

> head(chickegg)

Year chicken egg

1 1930 468491 3581

2 1931 449743 3532

3 1932 436815 3327

4 1933 444523 3255

5 1934 433937 3156

6 1935 389958 3081

> attach(chickegg)

> # plot the time series

> par(mfrow=c(2,1))

> plot.ts(chicken)

> plot.ts(egg)

The plots provide little information other than the data is likely not stationary. I’ve just started using the forecast package, so let’s load it and test for what will achieve stationarity.

> library(forecast)

> # test for unit root and number of differences required, you can also test for seasonality with nsdiffs

> ndiffs(chicken, alpha=0.05, test=c(“kpss”))

[1] 1

> ndiffs(egg, alpha=0.05, test=c(“kpss”))

[1] 1

> # differenced time series

> dchick <- diff(chicken)

> degg <- diff(egg)

> plot.ts(dchick)

> plot.ts(degg)

Much better!

That’s pretty standard stuff, but this is where the magic happens! There are several ways to find the optimal lag, which I will skip in the interest of time, but let’s say four is the magic number.

> # do eggs granger cause chickens?

> grangertest(dchick ~ degg, order=4)

Granger causality test

Model 1: dchick ~ Lags(dchick, 1:4) + Lags(degg, 1:4)

Model 2: dchick ~ Lags(dchick, 1:4)

Res.Df Df F Pr(>F)

1 40

2 44 -4 4.1762

**0.006414****—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Highly significant p-value, but what about the other direction?

> # do chickens granger cause eggs, at lag 4?

> grangertest(degg ~ dchick, order=4)

Granger causality test

Model 1: degg ~ Lags(degg, 1:4) + Lags(dchick, 1:4)

Model 2: degg ~ Lags(degg, 1:4)

Res.Df Df F Pr(>F)

1 40

2 44 -4 0.2817 0.8881

It is not significant, so we can say the eggs Granger-Cause chickens!

This is just the tip of the iceberg, but should be enough to strike up your curiosity and to make you dangerous. I’m working on commodity prices, bond prices and the U.S. stock market, but that is better left for another day.

Reference:

Thurman W.N. & Fisher M.E. (1988), Chickens, Eggs, and Causality, or Which Came First?,

*American Journal of Agricultural Economics*, 237-238.

To

**leave a comment**for the author, please follow the link and comment on their blog:**Fear and Loathing in Data Science**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...