Be careful evaluating model predictions

December 2, 2016

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score.

This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter software issued thrust commands in pound-seconds units to an engine expecting the commands to be in newton-seconds units. The two quantities are related by a constant ratio of 1.4881639, and therefore anything measured in pound-seconds units will have a correlation of 1.0 with the same measurement in newton-seconds units. However, one is not the other and the difference is why the Mars Climate Orbiter “encountered Mars at a lower than anticipated altitude and disintegrated due to atmospheric stresses.”

The need for a convenient direct F-test without accidentally triggering the implicit re-scaling that is associated with calculating a correlation is one of the reasons we supply the sigr R library. However, even then things can become confusing.


Please read on for a nasty little example.

Consider the following “harmless data frame.”

d <- data.frame(prediction=c(0,0,-1,-2,0,0,-1,-2),

The recommended test for checking the quality of “prediction” related to “actual” is an F-test (this is the test stats::lm uses). We can directly run this test with sigr (assuming we have installed the package) as follows:


F Test summary: (R2=-16, F(1,6)=-5.6, p=n.s.).

sigr reports an R-squared of -16 (please see here for some discussion of R-squared). This may be confusing, but it correctly communicates we have no model and in fact “prediction” is worse than just using the average (a very traditional null-model).

However, cor.test appears to think “prediction” is a usable model:


	Pearson's product-moment correlation

data:  d$prediction and d$actual
t = 1.1547, df = 6, p-value = 0.2921
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.3977998  0.8697404
sample estimates:

This is all for a prediction where sum((d$actual-d$prediction)^2)==66 which is larger than sum((d$actual-mean(d$actual))^2)==4. We concentrate on effects measures (such as R-squared) as we can drive the p-values wherever we want just by adding more data rows. Our point is: you are worse off using this model than using the mean-value of the actual (2) as your constant predictor. To my mind that is not a good prediction. And lm seems similarly excited about “prediction.”


lm(formula = actual ~ prediction, data = d)

     Min       1Q   Median       3Q      Max 
-0.90909 -0.43182  0.09091  0.52273  0.72727 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.2727     0.3521   6.455 0.000655 ***
prediction    0.3636     0.3149   1.155 0.292121    
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7385 on 6 degrees of freedom
Multiple R-squared:  0.1818,	Adjusted R-squared:  0.04545 
F-statistic: 1.333 on 1 and 6 DF,  p-value: 0.2921

One reason to not trust the lm result is it didn’t score the quality of “prediction“. It scored the quality of “0.3636*prediction + 2.2727.” It can be the case that “0.3636*prediction + 2.2727” is in fact a good predictor. But
that doesn’t help us if it is “prediction” we are showing to our boss or putting into production. We can try to mitigate this by insisting lm try to stay closer to the original by turning off the intercept or offset with the “0+” notation. That looks like the following.


lm(formula = actual ~ 0 + prediction, data = d)

   Min     1Q Median     3Q    Max 
  0.00   0.00   1.00   2.25   3.00 

           Estimate Std. Error t value Pr(>|t|)
prediction  -1.0000     0.6094  -1.641    0.145

Residual standard error: 1.927 on 7 degrees of freedom
Multiple R-squared:  0.2778,	Adjusted R-squared:  0.1746 
F-statistic: 2.692 on 1 and 7 DF,  p-value: 0.1448

Even the lm(0+)‘s adjusted prediction is bad as we see below:

d$lmPred <- predict(lm(actual~0+prediction,data=d))
[1] 26

Yes, the lm(0+) found a way to improve the prediction; but the improved prediction is still very bad (worse than using a well chosen constant). And it is hard to argue that “-prediction” is the same model as “prediction.”

Now sigr is fairly new code, so it is a bit bold saying it is right when it disagrees with the standard methods. However sigr is right in this case. The standard methods are not so much wrong as different, for two reasons:

  1. They are answering different questions. The F-test is designed to check if the predictions in-hand are good or not; “cor.test” and “lm %>% summary” are designed to check if any re-scaling of the prediction is in fact good. These are different questions. Using “cor.test” or “lm %>% summary” to test the utility of a potential variable is a good idea. The reprocessing hidden in these tests is consistent with the later use of a variable in a model. Using them to score model results that are supposed be directly used is wrong.
  2. From the standard R code’s point of view it isn’t obvious what the right “null model” is. Remember our original point: the quality measures on lm(0+) are designed to see how well lm(0+) is working. This means the lm(0+) scores the quality of its output (not its inputs) so it gets credit for flipping the sign on the prediction. Also it considers the natural null-model to be one it can form where there are no variable driven effects. Since there is no intercept or “dc-term” in these models (caused by the “0+” notation) the grand average is not considered a plausible null-model as it isn’t in the concept space of the modeling situation the lm was presented with. Or from help(summary.lm):

    R^2, the ‘fraction of variance explained by the model’,
    R^2 = 1 – Sum(R[i]^2) / Sum((y[i]- y*)^2),
    where y* is the mean of y[i] if there is an intercept and zero otherwise.

I admit, this is very confusing. But it corresponds to documentation, and makes sense from a modeling perspective. It is correct. The silent switching of null model from average to zero make sense in the context it is defined in. It does not make sense for testing our prediction, but that is just one more reason to use the proper F-test directly instead of trying to hack “cor.test” or “lm(0+) %>% summary” to calculate it for you.

And that is what sigr is about: standard tests (using R supplied implementations) with a slightly different calling convention to better document intent (which in our case is almost always measuring the quality of a model separate from model construction). It is a new library, so it doesn’t yet have the documentation needed to achieve its goal, but we will eventually get there.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)