**Eran Raviv » R**, and kindly contributed to R-bloggers)

Spurious Regression problem dates back to Yule (1926): “Why Do We Sometimes Get Nonsense Correlations between Time-series?”. Lets see what is the problem, and how can we fix it. I am using Morgan Stanley (MS) symbol for illustration, pre-crisis time span. Take a look at the following figure, generated from the regression of MS on the S&P, *actual prices* of the stock, *actual prices* of the S&P, when we use actual prices we term it regression in levels, as in price levels, as oppose to log transformed or returns.

The results from the regression are:

Estimate | Std. Error | t.value | P.value | |
---|---|---|---|---|

(Intercept) | -46.4234 | 2.1827 | -21.27 | 0.0000 |

beta.hat | 0.8534 | 0.0178 | 47.90 | 0.0000 |

R^2 = 0.76 |

The graph looks fine, and the results make sense, but utterly wrong!

Thing is, the two series are upward drifting, so.. they drift together, it seems as if they are related. As a matter of fact, they are related, but **what we just did is the wrong way to check it**. Here is similar results from *x *and *y *random walks!!

^{?}View Code RSPLUS

y = cumsum(rnorm(250*10,0.05)) # random normal, with small (0.05) drift. x = cumsum(rnorm(250*10,0.05)) lm2 = lm(y~x) ; summary(lm2) plot(y, ty = "l", main = "Fitted (in blue) over Actual -- Random WALK this time", xlab = "x") ; lines(lm2$fit, col = 4) |

Estimate | Std.Error | t.value | P.value | |
---|---|---|---|---|

(Intercept) | 7.0474 | 0.4651 | 15.15 | 0.0000 |

x | 0.5862 | 0.0062 | 94.29 | 0.0000 |

R^2 = 0.78 |

Note the resemblance with the previous figure and table.

So.., analysis of two Random Walks which are clearly independent from each other *by construction*, and the analysis of two time series in levels can have same qualitative result, as if there is a significant positive correlation, that can’t be good right?

In real life, how would I know if what I see is an actual relation or the result of two **unrelated **series that, just so happen, are **drifting in the same direction**.

Here we step into the domain of the highly important yet amazingly boring of Unit Roots. This post is not about unit roots, and I want to keep it short not to lose the remaining 5% out of the 100% who started reading. Being abusive, it is suffice to say we need to remove the drift in the series, check here and reference therein for more information.

Once the drift is removed, we can verify that indeed there is a real relation, meaning Morgan Stanley stock movement is *actually *affected by the market movement. Removing the drift is easy, use returns or first differences. Feel important by telling your classmates that the series are not stationary, hence the transformation.

We can transform the data from levels to returns and re-execute the regression as follows:

^{?}View Code RSPLUS

library(quantmod) ; library(xtable) ; library(tseries) tckr = c('MS', 'SPY') end <- "2007-01-01" start<-format(Sys.Date() - 365*8,"%Y-%m-%d") # 8 years of data dat1 = (getSymbols(tckr[1], src="yahoo", from=start, to=end, auto.assign = FALSE)) dat2 = (getSymbols(tckr[2], src="yahoo", from=start, to=end, auto.assign = FALSE)) ret1 = (dat1[,4] - dat1[,1])/dat1[,1] # Convert to returns ret2 = (dat2[,4] - dat2[,1])/dat2[,1] lmret = lm(ret1~ret2) summary(lmret) plot(as.numeric(ret1)~as.numeric(lmret$fit)) abline(lmret, col = 2, lwd = 2.5) |

Now we can see that even after analyzing using returns, not levels, we still get a good fit.

You can use the “adf.test” function in package “tseries” to check if your series drift (stationary*) or not.

^{?}View Code RSPLUS

adf.test(as.numeric(dat1[,1])) # --> P.value is 0.6481 --> has Unit Root adf.test(as.numeric(ret1)) # --> P.value < 0.01 --> no Unit Root |

As a final note, fact that we cannot make any inference using price levels does not render the regression completely useless. Both “MS” and “S&P” series are NOT stationary, but together they ARE co-integrated, which is the main justification behind pairs trading. Co-integrated means that y-series may drift, x-series may drift, but the residual from the regression will not!

See how the residuals from the regression fluctuate around zero.

**Comments**

1. * — stationary process does not only mean “no drift”, we have weak definition and strong definition, see here for more information.

2. according to the graph it seems that it was a good time to short MS and hedge with the market at the end end of the time span I used, which is start of 2007. I leave it to the reader to check what would have been the loss on such a trade.

Thanks for reading.

**References**

Financial Econometrics: From Basics to Advanced Modeling Techniques (Frank J. Fabozzi Series)

A Companion to Theoretical Econometrics (Blackwell Companions to Contemporary Economics)

Financial Econometrics: Problems, Models, and Methods.

**leave a comment**for the author, please follow the link and comment on their blog:

**Eran Raviv » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...