Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In econometrics course we always say to our students that “if you fit a linear model with no constant, then you might have trouble. For instance, you might have a negative R-squared“. So I tried to find databases on the internet such that, when we compute a linear regression, we actually obtain a negative R squared. I have generated hundreds to random databases that should exhibit such a property, in R. With no success. Perhaps to be more specific, I should explain what might happen if we do not include a constant in a linear model. Consider the following dataset, where points are on a straight line, with a negative slope, far from the origin, symmetric with respect to the first diagonal.

> x=1:3
> y=3:1
> plot(x,y)


Points are on a straight line, so it is actually possible to get a perfect linear model. But only if we integrate a constant in our model. This is related to the fact that the correlation between our two variates is -1,

> cor(x,y)
[1] -1

The least-square program is here

i.e. the estimate of the slope is

Numerically, we obtain

> sum(x*y)/sum(x^2)
[1] 0.7142857

which is the actual slope on the illustration above. If we compute the sum of squares of errors (as a function of the slope), we have here

> ssr=function(b){sum((y-b*x)^2)}
> SSR=Vectorize(ssr)
> B=seq(-1,3,by=.1)
> plot(B,SSR(B),ylim=c(0,ssr(3)),cex=.6,type="b")

so the value we have computed is actually the minimum of the sum of squares of errors. But note that the sum of squares always exceeds the total sum of squares in red on the graph above

> ssr(b)
[1] 6.857143
> sum((y-mean(y))^2)
[1] 2

Recall that the total “coefficient of variation“, denoted , is defined as

i.e.

> 1-ssr(b)/sum((y-mean(y))^2)
[1] -2.428571

which is negative. It is also sometimes defined as “the square of the sample correlation coefficient between the outcomes and their predicted values“. Here it would be related to

> cor(b*x,y)
[1] -1

so we would have a unit . So obviously, using the in a model without a constant would give odd results. But the weird part is that if we run that regression with R, we get

> summary(lm(y~0+x))

Call:
lm(formula = y ~ 0 + x)

Residuals:
1       2       3
2.2857  0.5714 -1.1429

Coefficients:
Estimate Std. Error t value Pr(>|t|)
x   0.7143     0.4949   1.443    0.286

Residual standard error: 1.852 on 2 degrees of freedom
Multiple R-squared: 0.5102,	Adjusted R-squared: 0.2653
F-statistic: 2.083 on 1 and 2 DF,  p-value: 0.2857 

Here, the estimation is correct. But the we obtain tells us that the model is not that bad… So if anyone knows what R computes, I’d be glad to know. The value given by R (thanks Vincent for asking me to look carefully at the R source code) is obtained using Pythagoras’s theorem to compute the total sum of square,

> sum((b*x)^2)/(sum((b*x)^2)+sum((y-b*x)^2))
[1] 0.5102041

So be careful, the might look good, but meaningless !