# Spurious Regression illustrated

March 4, 2012
By

(This article was first published on Eran Raviv » R, and kindly contributed to R-bloggers)

Spurious Regression problem dates back to Yule (1926): “Why Do We Sometimes Get Nonsense Correlations between Time-series?”. Lets see what is the problem, and how can we fix it. I am using Morgan Stanley (MS) symbol for illustration, pre-crisis time span.  Take a look at the following figure, generated from the regression of MS on the S&P, actual prices of the stock, actual prices of the S&P, when we use actual prices we term it regression in levels, as in price levels, as oppose to log transformed or returns.

Regression in levels, Morgan Stanley price level and fitted values from the regression MS~SPY.

The results from the regression are:

Estimate Std. Error t.value P.value
(Intercept) -46.4234 2.1827 -21.27 0.0000
beta.hat 0.8534 0.0178 47.90 0.0000
R^2 = 0.76

The graph looks fine, and the results make sense, but utterly wrong!

Thing is, the two series are upward drifting, so.. they drift together, it seems as if they are related. As a matter of fact, they are related, but what we just did is the wrong way to check it. Here is similar results from x and y random walks!!

?View Code RSPLUS
 y = cumsum(rnorm(250*10,0.05)) # random normal, with small (0.05) drift. x = cumsum(rnorm(250*10,0.05)) lm2 = lm(y~x) ; summary(lm2) plot(y, ty = "l", main = "Fitted (in blue) over Actual -- Random WALK this time", xlab = "x") ; lines(lm2$fit, col = 4) Estimate Std.Error t.value P.value (Intercept) 7.0474 0.4651 15.15 0.0000 x 0.5862 0.0062 94.29 0.0000 R^2 = 0.78 Note the resemblance with the previous figure and table. So.., analysis of two Random Walks which are clearly independent from each other by construction, and the analysis of two time series in levels can have same qualitative result, as if there is a significant positive correlation, that can’t be good right? In real life, how would I know if what I see is an actual relation or the result of two unrelated series that, just so happen, are drifting in the same direction. Here we step into the domain of the highly important yet amazingly boring of Unit Roots. This post is not about unit roots, and I want to keep it short not to lose the remaining 5% out of the 100% who started reading. Being abusive, it is suffice to say we need to remove the drift in the series, check here and reference therein for more information. Once the drift is removed, we can verify that indeed there is a real relation, meaning Morgan Stanley stock movement is actually affected by the market movement. Removing the drift is easy, use returns or first differences. Feel important by telling your classmates that the series are not stationary, hence the transformation. We can transform the data from levels to returns and re-execute the regression as follows: ?View Code RSPLUS  library(quantmod) ; library(xtable) ; library(tseries) tckr = c('MS', 'SPY') end <- "2007-01-01" start<-format(Sys.Date() - 365*8,"%Y-%m-%d") # 8 years of data dat1 = (getSymbols(tckr[1], src="yahoo", from=start, to=end, auto.assign = FALSE)) dat2 = (getSymbols(tckr[2], src="yahoo", from=start, to=end, auto.assign = FALSE)) ret1 = (dat1[,4] - dat1[,1])/dat1[,1] # Convert to returns ret2 = (dat2[,4] - dat2[,1])/dat2[,1] lmret = lm(ret1~ret2) summary(lmret) plot(as.numeric(ret1)~as.numeric(lmret$fit)) abline(lmret, col = 2, lwd = 2.5)

Regression using returns

Now we can see that even after analyzing using returns, not levels, we still get a good fit.

You can use the “adf.test” function in package “tseries” to check if your series drift (stationary*) or not.

?View Code RSPLUS
 adf.test(as.numeric(dat1[,1])) # --> P.value is 0.6481 --> has Unit Root adf.test(as.numeric(ret1)) # --> P.value < 0.01 --> no Unit Root

As a final note, fact that we cannot make any inference using price levels does not render the regression completely useless. Both “MS” and “S&P” series are NOT stationary, but together they ARE co-integrated, which is the main justification behind pairs trading. Co-integrated means that y-series may drift, x-series may drift, but the residual from the regression will not!

Residuals from Regression on levels

See how the residuals from the regression fluctuate around zero.

1. * — stationary process does not only mean “no drift”, we have weak definition and strong definition, see here for more information.
2. according to the graph it seems that it was a good time to short MS and hedge with the market at the end end of the time span I used, which is start of 2007. I leave it to the reader to check what would have been the loss on such a trade.

References

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...