# Why using R-squared is a bad idea

**Heidi's stats blog - Rbloggers**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The coefficient of determination is otherwise known as (R^2) and is often used to

determine whether a model is good.

The Wikipedia article

says that (R^2) “[…] is a number that indicates how well data fit a statistical model –

sometimes simply a line or a curve. An (R^2) of 1 indicates that the regression line perfectly

fits the data, while an (R^2) of 0 indicates that the line does not fit the data at all”.

From this we could conclude that we can use this measure to indicate how good our model

is. There are, however, 2 major problems with that conclusion:

- Sometimes just having a model that is a little better than random guessing is already

great. - Adding useless covariates to the model improves (R^2).

The first argument is supported best by the example stock market. If I have a model

that is just a little bit better than random guessing, I will be rich.

The second argument I want to show on an example.

```
<span class="n">set.seed</span><span class="p">(</span><span class="m">17</span><span class="p">)</span>
<span class="n">n</span> <span class="o"><-</span> <span class="m">200</span>
<span class="n">x</span> <span class="o"><-</span> <span class="n">rnorm</span><span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="n">n</span><span class="p">,</span> <span class="n">mean</span> <span class="o">=</span> <span class="m">1</span><span class="p">,</span> <span class="n">sd</span> <span class="o">=</span> <span class="m">1</span><span class="p">)</span>
<span class="n">y</span> <span class="o"><-</span> <span class="m">2</span> <span class="o">+</span> <span class="m">5</span> <span class="o">*</span> <span class="n">x</span> <span class="o">+</span> <span class="n">rnorm</span><span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="n">n</span><span class="p">,</span> <span class="n">mean</span> <span class="o">=</span> <span class="m">0</span><span class="p">,</span> <span class="n">sd</span> <span class="o">=</span> <span class="m">20</span><span class="p">)</span>
<span class="n">df</span> <span class="o"><-</span> <span class="n">data.frame</span><span class="p">(</span><span class="n">y</span> <span class="o">=</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">)</span>
<span class="n">lmod</span> <span class="o"><-</span> <span class="n">lm</span><span class="p">(</span><span class="n">y</span> <span class="o">~</span> <span class="n">x</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">df</span><span class="p">)</span>
<span class="n">summary</span><span class="p">(</span><span class="n">lmod</span><span class="p">)</span>
```

```
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.303 -12.432 -0.923 12.558 58.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.336 1.838 0.727 0.468
## x 6.749 1.271 5.310 2.94e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.63 on 198 degrees of freedom
## Multiple R-squared: 0.1246, Adjusted R-squared: 0.1202
## F-statistic: 28.19 on 1 and 198 DF, p-value: 2.935e-07
```

`<span class="n">summary</span><span class="p">(</span><span class="n">lmod</span><span class="p">)</span><span class="o">$</span><span class="n">r.squared</span>`

`## [1] 0.1246397`

So we have an outcome (y) that depends on a covariate (x), but the noise is

very high, so our (R^2) is pretty low. Let’s add further useless covarates to the

model.

```
<span class="n">R2</span> <span class="o"><-</span> <span class="n">data.frame</span><span class="p">(</span><span class="n">p_useless</span> <span class="o">=</span> <span class="n">NULL</span><span class="p">,</span> <span class="n">R2</span> <span class="o">=</span> <span class="n">NULL</span><span class="p">,</span> <span class="n">adjusted</span> <span class="o">=</span> <span class="n">NULL</span><span class="p">)</span>
<span class="k">for</span><span class="p">(</span><span class="n">i</span> <span class="k">in</span> <span class="m">1</span><span class="o">:</span><span class="p">(</span><span class="n">n</span><span class="m">-1</span><span class="p">))</span> <span class="p">{</span>
<span class="n">j</span> <span class="o"><-</span> <span class="n">ncol</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span> <span class="p">,</span> <span class="n">j</span><span class="m">+1</span><span class="p">]</span> <span class="o"><-</span> <span class="n">rnorm</span><span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="n">n</span><span class="p">,</span> <span class="n">mean</span> <span class="o">=</span> <span class="m">0</span><span class="p">,</span> <span class="n">sd</span> <span class="o">=</span> <span class="m">1</span><span class="p">)</span>
<span class="n">tmp</span> <span class="o"><-</span> <span class="n">data.frame</span><span class="p">(</span><span class="n">p_useless</span> <span class="o">=</span> <span class="n">i</span><span class="p">,</span>
<span class="n">R2</span> <span class="o">=</span> <span class="n">c</span><span class="p">(</span><span class="n">summary</span><span class="p">(</span><span class="n">lm</span><span class="p">(</span><span class="n">y</span> <span class="o">~</span> <span class="err">.</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">df</span><span class="p">))</span><span class="o">$</span><span class="n">r.squared</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">R2</span> <span class="o"><-</span> <span class="n">rbind</span><span class="p">(</span><span class="n">R2</span><span class="p">,</span> <span class="n">tmp</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span>
<span class="n">ggplot</span><span class="p">(</span><span class="n">data</span> <span class="o">=</span> <span class="n">R2</span><span class="p">,</span> <span class="n">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">p_useless</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">R2</span><span class="p">))</span> <span class="o">+</span>
<span class="n">geom_point</span><span class="p">()</span> <span class="o">+</span>
<span class="n">ylab</span><span class="p">(</span><span class="n">expression</span><span class="p">(</span><span class="n">R</span><span class="o">^</span><span class="m">2</span><span class="p">))</span> <span class="o">+</span>
<span class="n">xlab</span><span class="p">(</span><span class="s2">"number of useless covariates"</span><span class="p">)</span>
```

And voilà the more random covariates we add the better the model according to (R^2).

Does not make sense right?

**leave a comment**for the author, please follow the link and comment on their blog:

**Heidi's stats blog - Rbloggers**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.