That damn R-squared !

Posted on September 7, 2012 by arthur charpentier in R bloggers | 0 Comments

[This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Another post about the R-squared coefficient, and about why, after some years teaching econometrics, I still hate when students ask questions about it. Usually, it starts with “I have a _____ R-squared… isn’t it too low ?” Please, feel free to fill in the blanks with your favorite (low) number. Say 0.2. To make it simple, there are different answers to that question:

if you don’t want to waste time understanding econometrics, I would say something like “Forget about the R-squared, it is useless” (perhaps also “please, think twice about taking that econometrics course“)
if you’re ready to spend some time to get a better understanding on subtle concepts, I would say “I don’t like the R-squared. I might be interesting in some rare cases (you can probably count them on the fingers of one finger), like comparing two models on the same dataset (even so, I would recommend the adjusted one). But usually, its values has no meaning. You can compare 0.2 and 0.3 (and prefer the 0.3 R-squared model, rather than the 0.2 R-squared one), but 0.2 means nothing“. Well, not exactly, since it means something, but it is not a measure tjat tells you if you deal with a good or a bad model. Well, again, not exactly, but it is rather difficult to say where bad ends, and where good starts. Actually, it is exactly like the correlation coefficient (well, there is nothing mysterious here since the R-squared can be related to some correlation coefficient, as mentioned in class)
if you want some more advanced advice, I would say “It’s complicated…” (and perhaps also “Look in a textbook write by someone more clever than me, you can find hundreds of them in the library !“)
if you want me to act like people we’ve seen recently on TV (during electoral debate), “It’s extremely interesting, but before answering your question, let me tell you a story…“

Perhaps that last strategy is the best one, and I should focus on the story. I mean, this is exactly why I have my blog: to tell (nice) stories. With graphs, and math formulas inside. First of all, consider a regression model

so that the R-squared is defined as

Let us generate datasets, and then run regressions, to see what’s going on…

For instance, consider 20 observations, with one variable of interest, one explanatory variable, and some low variance noise (to start with)

> set.seed(1)
> n=20
> X=runif(n)
> E=rnorm(n)
> Y=2+5*X+E*.5
> base=data.frame(X,Y)
> reg=lm(Y~X,data=base)
> summary(reg)
 
Call:
lm(formula = Y ~ X, data = base)
 
Residuals:
Min       1Q   Median       3Q      Max
-1.15961 -0.17470  0.08719  0.29409  0.52719
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   2.4706     0.2297   10.76 2.87e-09 ***
X             4.2042     0.3697   11.37 1.19e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.461 on 18 degrees of freedom
Multiple R-squared: 0.8778,	Adjusted R-squared: 0.871
F-statistic: 129.3 on 1 and 18 DF,  p-value: 1.192e-09

The R-squared is high (almost 0.9). What if the underlying model is exactly the same, but now, the noise has a much higher variance ?

> Y=2+5*X+E*4
> base=data.frame(X,Y)
> reg=lm(Y~X,data=base)
> summary(reg)
 
Call:
lm(formula = Y ~ X, data = base)
 
Residuals:
Min      1Q  Median      3Q     Max
-9.2769 -1.3976  0.6976  2.3527  4.2175
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    5.765      1.837   3.138  0.00569 **
X             -1.367      2.957  -0.462  0.64953
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 3.688 on 18 degrees of freedom
Multiple R-squared: 0.01173,	Adjusted R-squared: -0.04318
F-statistic: 0.2136 on 1 and 18 DF,  p-value: 0.6495

Now, the R-squared is rather low (around 0.01). Thus, the quality of the regression depends clearly on the variance of the noise. The higher the variance, the lower the R-squared. And usually, there is not much you can do about it ! On the graph below, the noise is changing, from no-noise, to extremely noisy, with the least square regression in blue (and a confidence interval on the prediction)

If we compare with the graph below, one can observe that the quality of the fit depends on the sample size, with now 100 observations (instead of 20),

So far, nothing new (if you remember what was said in class). In those two cases, here is the evolution of the R-squared, as a function of the variance of the noise (more precisely, here, the standard deviation of the noise)

> S=seq(0,4,by=.2)
> R2=rep(NA,length(S))
> for(s in 1:length(S)){
+ Y=2+5*X+E*S[s]
+ base=data.frame(X,Y)
+ reg=lm(Y~X,data=base)
+ R2[s]=summary(reg)$r.squared}

with 20 obervations in blue, 100 in black. One important point is that in econometrics, we rarely choose the number of observations. If we have only 100 observations, we have to deal with it. Similarly, if observations are quite noisy there is (usually) not much we can do about it. All the more if you don’t have any explanatory variable left. Perhaps you migh play (or try to play) with nonlinear effect…

Nevertheless, it looks like some econometricians really care about the R-squared, and cannot imagine looking at a model if the R-squared is lower than – say – 0.4. It is always possible to reach that level ! you just have to add more covariates ! If you have some… And if you don’t, it is always possible to use polynomials of a continuous variate. For instance, on the previous example,

> S=seq(1,25,by=1)
> R2=rep(NA,length(S))
> for(s in 1:length(S)){
+ reg=lm(Y~poly(X,degree=s),data=base)
+ R2[s]=summary(reg)$r.squared}

If we plot the R-squared as a function of the degree of the polynomial regression, we have the following graph. Once again, the higher the degree, the more covariates, and the more covariates, the higher the R-squared,

I.e. with 22 degrees, it is possible to reach a 0.4 R-squared. But it might be interesting to the prediction we have with that model,

So, was it worth adding so much polynomial parts ? I mean, 22 is quite a large power… Here, the linear regression was significant, but not great. So what ? The R-squared was small ? so what ? sometimes, there’s not much you can do about it… When dealing with individual observations (so called micro-econometrics), the variable of interest might be extremely noisy, and there is not much you can do. So your R-squared can be small, but your regression is perhaps still better than doing nothing… The good thing with a low R-squared is perhaps that it will recall us that we have to remain modest when we build a model. And always be cautious about conclusion. At least, provide a confidence interval…

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

That damn R-squared !

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)