Help! My model fits too well!

[This article was first published on Nor Talk Too Wise » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is sort-of related to my sidelined study of graph algebra. I was thinking about data I could apply a first-order linear difference model to, and the stock market came to mind. After all, despite some black swan sized shocks, what better predicts a day’s closing than the previous day’s closing? So, I hunted down the data and graphed exactly that:

Isn’t that just lovely?  The tight clustering around the line indicates that we have found a very good linear fit.  How good?  Well, lets take a peek at our summary(model)

lm(formula = close ~ open)

 Min        1Q    Median        3Q       Max
-774.9914   -3.4477   -0.3122    3.2318  924.4627 

            Estimate   Std. Error    t value  Pr(>|t|)    
(Intercept) 0.4299070  0.4206543     1.022    0.307    
open        1.0000351  0.0000954     10482.9  <2e-16***

Residual standard error: 49.74 on 20593 degrees of freedom
Multiple R-squared: 0.9998,     Adjusted R-squared: 0.9998
F-statistic: 1.099e+08 on 1 and 20593 DF,  p-value: < 2.2e-16

Whoa!  An R-squared of .9998.  In other words, my very simple model describes 99.98% of all the variation seen in the Dow Jones industrial Index days-end prices.  Show this to any statistician and they’d say that’s nearly impossible.  You’ve got to have some tautology in the model, some independent variable that is basically the same as the dependent variable.  And they’d be right.  However, the linear model is not my goal.  I don’t want to predict the progress of the Dow over a day.  I want to do it over a much longer term.  For that reason, I can look past their complaints and build the first-order linear difference model.

If we plot the function y(x) = 1.0000351(y(x-1)) + 0.4299070, the output is a little less than satisfying.  Here is that function over a scatterplot of Dow scores:

That looks pretty underwhelming.  In fact, it almost looks…linear.  Gross.  What happened?

First off, I assure you it is not the problem the aforementioned statisticians pointed out.  The real problem was that, though our slope was really convincing, it was also really close to 1.  Which means that it basically fell out of our equation, leaving y(x) = y(x-1) + .423.  If all we’re doing is adding .423 every iteration, we have in fact generated the linear equation y = .423x + .423.  That doesn’t tell me anything about the stock market!

Take home points:

  • Data follows a law of diminishing returns.  Your second observation is worth a hell of a lot more than your hundredth.  And there does come a point in the model where more data will not only stop helping you, it will probably hurt you.  If we isolated a particular time period (like from Reagan onward to examine the effects of his policies on stock market behavior), then we could tell a lot more.
  • A slope of 1 in a difference model is a very bad thing.  It means the model will inevitably be nearly useless.  Just use a linear regression or loess on the scatterplot and run with that.
  • Just because it’s significant doesn’t mean it’s useful.  Don’t be seduced by the p-value or the R-squared of the linear model.  While they are impressive, the model was (nearly) tautological, so those otherwise jaw-dropping numbers were (nearly) inevitable.
  • That said, the tautology doesn’t create the slope of 1.  Don’t be put off by statistical measures that aren’t in themselves terribly useful, like an r-squared of .9998 or p-value of 2.2e-16.
  • In other words, though the linear difference model is dependent upon the linear model, the significance or usefulness of the former is entirely independent from that of the latter.

If you’re interested in running this yourself, the R code is here:

df <- read.csv(file="", head=TRUE, sep=",")
model <- lm(close ~ open)
plot(open, close, xlab="", ylab="", pch=19)
title(xlab="X", ylab="X(t+1)", main="Plot of the first differences", cex=1.5, col="black", font=2)
abline(model, lwd=2)

y2 <- 0
t <- 0
y1 <- .3
a <- model$coefficients[[2]]
b <- model$coefficients[[1]]
timeserieslength <- nrow(df)
for (i in 1:timeserieslength) {
 y2[i] <- (a*y1[i])+b
 t[i] <- i
 if (i < timeserieslength) y1[i+1]=y2[i]}
plot(t, close, xlab="time", ylab="Dow Jones Industrial Index", main="DJIA over time, 1928-2010", pch=19)
lines(t, y2, lwd=2)

To leave a comment for the author, please follow the link and comment on their blog: Nor Talk Too Wise » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)