# Introduction to Cointegration and Pairs Trading

April 15, 2011
By

(This article was first published on Edwin Chen's Blog » r, and kindly contributed to R-bloggers)

# Introduction

Suppose you see two drunks (i.e., two random walks) wandering around. The drunks don’t know each other (they’re independent), so there’s no meaningful relationship between their paths.

But suppose instead you have a drunk walking with her dog. This time there is a connection. What’s the nature of this connection? Notice that although each path individually is still an unpredictable random walk, given the location of one of the drunk or dog, we have a pretty good idea of where the other is; that is, the distance between the two is fairly predictable. (For example, if the dog wanders too far away from his owner, she’ll tend to move in his direction to avoid losing him, so the two stay close together despite a tendency to wander around on their own.) We describe this relationship by saying that the drunk and her dog form a cointegrating pair.

In more technical terms, if we have two non-stationary time series X and Y that become stationary when differenced (these are called integrated of order one series, or I(1) series; random walks are one example) such that some linear combination of X and Y is stationary (aka, I(0)), then we say that X and Y are cointegrated. In other words, while neither X nor Y alone hovers around a constant value, some combination of them does, so we can think of cointegration as describing a particular kind of long-run equilibrium relationship. (The definition of cointegration can be extended to multiple time series, with higher orders of integration.)

Other examples of cointegrated pairs:

• Income and consumption: as income increases/decreases, so too does consumption.
• Size of police force and amount of criminal activity
• A book and its movie adaptation: while the book and the movie may differ in small details, the overall plot will remain the same.
• Number of patients entering or leaving a hospital

# An application

So why do we care about cointegration? In quantitative finance, cointegration forms the basis of the pairs trading strategy: suppose we have two cointegrated stocks X and Y, with the particular (for concreteness) cointegrating relationship X – 2Y = Z, where Z is a stationary series of zero mean. For example, X could be McDonald’s, Y could be Burger King, and the cointegration relationship would mean that X tends to be priced twice as high as Y, so that when X is more than twice the price of Y, we expect X to move down or Y to move up in the near future (and analogously, if X is less than twice the price of Y, we expect X to move up or Y to move down). This suggests the following trading strategy: if X – 2Y > d, for some positive threshold d, then we should sell X and buy Y (since we expect X to decrease in price and Y to increase), and similarly, if X – 2Y < -d, then we should buy X and sell Y.

# Spurious regression

But why do we need the notion of cointegration at all? Why can’t we simply use, say, the R-squared between X or Y to see if X and Y have some kind of relationship? The reason is that standard regression analysis fails when dealing with non-stationary variables, leading to spurious regressions that suggest relationships even when there are none.

For example, suppose we regress two independent random walks against each other, and test for a linear relationship. A large percentage of the time, we’ll find high R-squared values and low p-values when using standard OLS statistics, even though there’s absolutely no relationship between the two random walks. As an illustration, here I simulated 1000 pairs of random walks of length 100, and found p-values less than 0.05 in 77% of the cases:

# A Cointegration Test

So how do you detect cointegration? There are several different methods, but the simplest is the Engle-Granger test, which works roughly as follows:

• Check that $X _ t$ and $Y _ t$ are both I(1).
• Estimate the cointegrating relationship $Y _ t = aX _ t + e _ t$ by ordinary least squares.
• Check that the cointegrating residuals $e _ t$ are stationary (say, by using a so-called unit root test, e.g., the Dickey-Fuller test).

# Error-correction and Granger representation

Something else that should perhaps be mentioned is the relationship between cointegration and error-correction mechanisms: suppose we have two cointegrated series $X _ t, Y _ t$, with autoregressive representations

$X _ t = a X _ {t-1} + b Y _ {t-1} + u _ t$
$Y _ t = c X _ {t-1} + d Y _ {t-1} + v _ t$

By the Granger representation theorem (which is actually a bit more general than this), we then have

$\Delta X _ t = \alpha _ 1 (Y _ {t-1} - \beta X _ {t-1}) + u _ t$
$\Delta Y _ t = \alpha _ 2 (Y _ {t-1} - \beta X _ {t-1}) + v _ t$

where $Y _ {t-1} - \beta X _ {t-1} \sim I(0)$ is the cointegrating relationship. Regarding $Y _ {t-1} - \beta X _ {t-1}$ as the extent of disequilibrium from the long-run relationship, and the $\alpha _ i$ as the speed (and direction) at which the time series correct themselves from this disequilibrium, we can see that this formalizes the way cointegrated variables adjust to match their long-run equilbrium.

# Summary

So, just to summarize a bit, cointegration is an equilibrium relationship between time series that individually aren’t in equilbrium (you can kind of contrast this with (Pearson) correlation, which describes a linear relationship), and it’s useful because it allows us to incorporate both short-term dynamics (deviations from equilibrium) and long-run expectations (corrections to equilibrium).

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...