On linear models with no constant and R2

February 2, 2012
By

(This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers)

In econometrics course we always say to our students that "if you fit a linear model with no constant, then you might have trouble. For instance, you might have a negative R-squared". So I tried to find databases on the internet such that, when we compute a linear regression, we actually obtain a negative R squared. I have generated hundreds to random databases that should exhibit such a property, in R. With no success. Perhaps to be more specific, I should explain what might happen if we do not include a constant in a linear model. Consider the following dataset, where points are on a straight line, with a negative slope, far from the origin, symmetric with respect to the first diagonal.

> x=1:3
> y=3:1
> plot(x,y)

Points are on a straight line, so it is actually possible to get a perfect linear model. But only if we integrate a constant in our model. This is related to the fact that the correlation between our two variates is -1,

> cor(x,y)
[1] -1

The least-square program is here

http://freakonometrics.blog.free.fr/public/perso5/olssc01b.gif

i.e. the estimate of the slope is

http://freakonometrics.blog.free.fr/public/perso5/olcsc02.gif

Numerically, we obtain

> sum(x*y)/sum(x^2)
[1] 0.7142857

which is the actual slope on the illustration above. If we compute the sum of squares of errors (as a function of the slope), we have here

> ssr=function(b){sum((y-b*x)^2)}
> SSR=Vectorize(ssr)
> B=seq(-1,3,by=.1)
> plot(B,SSR(B),ylim=c(0,ssr(3)),cex=.6,type="b")

so the value we have computed is actually the minimum of the sum of squares of errors. But note that the sum of squares always exceeds the total sum of squares in red on the graph above

> ssr(b)
[1] 6.857143
> sum((y-mean(y))^2)
[1] 2

Recall that the total "coefficient of variation", denoted http://freakonometrics.blog.free.fr/public/perso5/R2.gif, is defined as

http://freakonometrics.blog.free.fr/public/perso5/olsnc04.gif

i.e.

> 1-ssr(b)/sum((y-mean(y))^2)
[1] -2.428571

which is negative. It is also sometimes defined as "the square of the sample correlation coefficient between the outcomes and their predicted values". Here it would be related to 

> cor(b*x,y)
[1] -1

so we would have a unit http://freakonometrics.blog.free.fr/public/perso5/R2.gif . So obviously, using the http://freakonometrics.blog.free.fr/public/perso5/R2.gif in a model without a constant would give odd results. But the weird part is that if we run that regression with R, we get

> summary(lm(y~0+x))
 
Call:
lm(formula = y ~ 0 + x)
 
Residuals:
1       2       3
2.2857  0.5714 -1.1429
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x   0.7143     0.4949   1.443    0.286
 
Residual standard error: 1.852 on 2 degrees of freedom
Multiple R-squared: 0.5102,	Adjusted R-squared: 0.2653
F-statistic: 2.083 on 1 and 2 DF,  p-value: 0.2857 

Here, the estimation is correct. But the http://freakonometrics.blog.free.fr/public/perso5/R2.gif we obtain tells us that the model is not that bad... So if anyone knows what R computes, I'd be glad to know. The value given by R (thanks Vincent for asking me to look carefully at the R source code) is obtained using Pythagoras's theorem to compute the total sum of square,

> sum((b*x)^2)/(sum((b*x)^2)+sum((y-b*x)^2))
[1] 0.5102041

So be careful, the http://freakonometrics.blog.free.fr/public/perso5/R2.gif might look good, but meaningless !

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics - Tag - R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , ,

Comments are closed.