# Why pictures are so important when modeling data?

October 31, 2012
By

(This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers)

(bis repetita) Consider the following regression summary,

```Call:
lm(formula = y1 ~ x1)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.0001     1.1247   2.667  0.02573 *
x1            0.5001     0.1179   4.241  0.00217 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6665,	Adjusted R-squared: 0.6295
F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217 ```
obtained from
```> data(anscombe)
> reg1=lm(y1~x1,data=anscombe)```
Can we say something if we look (only) at that output ? The intercept is significatively non-null, as well as the slope, the $R^2$ is large (66%). It looks like we do have a nice model here. And in a perfect world, we might hope that data are coming from this kind of dataset,
But it might be possible to have completely different kinds of patterns. Actually, four differents sets of data are coming from Anscombe (1973). And that all those datasets are somehow equivalent: the $X$‘s have the same mean, and the same variance
```> apply(anscombe[,1:4],2,mean)
x1 x2 x3 x4
9  9  9  9
> apply(anscombe[,1:4],2,var)
x1 x2 x3 x4
11 11 11 11 ```
and so are the $Y$‘s
```> apply(anscombe[,5:8],2,mean)
y1       y2       y3       y4
7.500909 7.500909 7.500000 7.500909
> apply(anscombe[,5:8],2,var)
y1       y2       y3       y4
4.127269 4.127629 4.122620 4.123249 ```
Further, observe also that the correlation between the $X$‘s and the $Y$‘s is the same
```> cor(anscombe)[1:4,5:8]
y1         y2         y3         y4
x1  0.8164205  0.8162365  0.8162867 -0.3140467
x2  0.8164205  0.8162365  0.8162867 -0.3140467
x3  0.8164205  0.8162365  0.8162867 -0.3140467
x4 -0.5290927 -0.7184365 -0.3446610  0.8165214
> diag(cor(anscombe)[1:4,5:8])
[1] 0.8164205 0.8162365 0.8162867 0.8165214```
which yields the same regression line (intercept and slope)
```> cbind(coef(reg1),coef(reg2),coef(reg3),coef(reg4))
[,1]     [,2]      [,3]      [,4]
(Intercept) 3.0000909 3.000909 3.0024545 3.0017273
x1          0.5000909 0.500000 0.4997273 0.4999091```
But there is more. Much more. For instance, we always have the standard deviation for residuals
```> c(summary(reg1)\$sigma,summary(reg2)\$sigma,
+ summary(reg3)\$sigma,summary(reg4)\$sigma)
[1] 1.236603 1.237214 1.236311 1.235695```
Thus, all regressions here have the same R2
```> c(summary(reg1)\$r.squared,summary(reg2)\$r.squared,
+ summary(reg3)\$r.squared,summary(reg4)\$r.squared)
[1] 0.6665425 0.6662420 0.6663240 0.6667073```
Finally, Fisher’s F statistics is also (almost) the same.
```+ c(summary(reg1)\$fstatistic[1],summary(reg2)\$fstatistic[1],
+ summary(reg3)\$fstatistic[1],summary(reg4)\$fstatistic[1])
value    value    value    value
17.98994 17.96565 17.97228 18.00329 ```
Thus, with the following datasets, we have the same prediction (and the same confidence intervals). Consider for instance the second dataset (the first one being mentioned above),
`> reg2=lm(y2~x2,data=anscombe)`
The output is here exactly the same as the one we had above
```> summary(reg2)

Call:
lm(formula = y2 ~ x2, data = anscombe)

Residuals:
Min      1Q  Median      3Q     Max
-1.9009 -0.7609  0.1291  0.9491  1.2691

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    3.001      1.125   2.667  0.02576 *
x2             0.500      0.118   4.239  0.00218 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6662,	Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179 ```
Here, the perfect model is the one obtained with a quadratic regression.
```> reg2b=lm(y2~x2+I(x2^2),data=anscombe)
> summary(reg2b)

Call:
lm(formula = y2 ~ x2 + I(x2^2), data = anscombe)

Residuals:
Min         1Q     Median         3Q        Max
-0.0013287 -0.0011888 -0.0006294  0.0008741  0.0023776

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.9957343  0.0043299   -1385   <2e-16 ***
x2           2.7808392  0.0010401    2674   <2e-16 ***
I(x2^2)     -0.1267133  0.0000571   -2219   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.001672 on 8 degrees of freedom
Multiple R-squared:     1,	Adjusted R-squared:     1
F-statistic: 7.378e+06 on 2 and 8 DF,  p-value: < 2.2e-16 ```
Consider now the third one
`> reg3=lm(y3~x3,data=anscombe)`
i.e.
```> summary(reg3)

Call:
lm(formula = y3 ~ x3, data = anscombe)

Residuals:
Min      1Q  Median      3Q     Max
-1.1586 -0.6146 -0.2303  0.1540  3.2411

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.0025     1.1245   2.670  0.02562 *
x3            0.4997     0.1179   4.239  0.00218 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared: 0.6663,	Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176 ```
This time, the linear model could have been perfect. The problem is one outlier. If we remove it, we have
```> reg3b=lm(y3~x3,data=anscombe[-3,])
> summary(reg3b)

Call:
lm(formula = y3 ~ x3, data = anscombe[-3, ])

Residuals:
Min         1Q     Median         3Q        Max
-0.0041558 -0.0022240  0.0000649  0.0018182  0.0050649

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0056494  0.0029242    1370   <2e-16 ***
x3          0.3453896  0.0003206    1077   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.003082 on 8 degrees of freedom
Multiple R-squared:     1,	Adjusted R-squared:     1
F-statistic: 1.161e+06 on 1 and 8 DF,  p-value: < 2.2e-16 ```
Finally consider
`> reg4=lm(y4~x4,data=anscombe)`
This time, there is an other kind of outlier, in $X$‘s, but again, the regression is exactly the same,
```> summary(reg4)

Call:
lm(formula = y4 ~ x4, data = anscombe)

Residuals:
Min     1Q Median     3Q    Max
-1.751 -0.831  0.000  0.809  1.839

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.0017     1.1239   2.671  0.02559 *
x4            0.4999     0.1178   4.243  0.00216 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared: 0.6667,	Adjusted R-squared: 0.6297
F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165 ```
The graph is here
So clearly, looking at the summary of a regression does not tell us anything… This is why we do spend some time on diagnostic, looking at graphs with the errors (the graphs above could be obtained only with one explanatory variable, while errors can be studied in any dimension): everything can be seen on thise graphs. E.g. for the first dataset,
or the second one
the third one
or the fourth one,

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , ,