Why pictures are so important when modeling data?

[This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

(bis repetita) Consider the following regression summary,
Call:
lm(formula = y1 ~ x1)
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.0001     1.1247   2.667  0.02573 *
x1            0.5001     0.1179   4.241  0.00217 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6665,	Adjusted R-squared: 0.6295
F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217 
obtained from
> data(anscombe)
> reg1=lm(y1~x1,data=anscombe)
Can we say something if we look (only) at that output ? The intercept is significatively non-null, as well as the slope, the  is large (66%). It looks like we do have a nice model here. And in a perfect world, we might hope that data are coming from this kind of dataset, 
But it might be possible to have completely different kinds of patterns. Actually, four differents sets of data are coming from Anscombe (1973). And that all those datasets are somehow equivalent: the ‘s have the same mean, and the same variance
> apply(anscombe[,1:4],2,mean)
x1 x2 x3 x4
9  9  9  9
> apply(anscombe[,1:4],2,var)
x1 x2 x3 x4
11 11 11 11 
and so are the ‘s
> apply(anscombe[,5:8],2,mean)
y1       y2       y3       y4
7.500909 7.500909 7.500000 7.500909
> apply(anscombe[,5:8],2,var)
y1       y2       y3       y4
4.127269 4.127629 4.122620 4.123249 
Further, observe also that the correlation between the ‘s and the ‘s is the same
> cor(anscombe)[1:4,5:8]
y1         y2         y3         y4
x1  0.8164205  0.8162365  0.8162867 -0.3140467
x2  0.8164205  0.8162365  0.8162867 -0.3140467
x3  0.8164205  0.8162365  0.8162867 -0.3140467
x4 -0.5290927 -0.7184365 -0.3446610  0.8165214
> diag(cor(anscombe)[1:4,5:8])
[1] 0.8164205 0.8162365 0.8162867 0.8165214
which yields the same regression line (intercept and slope)
> cbind(coef(reg1),coef(reg2),coef(reg3),coef(reg4))
[,1]     [,2]      [,3]      [,4]
(Intercept) 3.0000909 3.000909 3.0024545 3.0017273
x1          0.5000909 0.500000 0.4997273 0.4999091
But there is more. Much more. For instance, we always have the standard deviation for residuals
> c(summary(reg1)$sigma,summary(reg2)$sigma,
+ summary(reg3)$sigma,summary(reg4)$sigma)
[1] 1.236603 1.237214 1.236311 1.235695
Thus, all regressions here have the same R2
> c(summary(reg1)$r.squared,summary(reg2)$r.squared,
+ summary(reg3)$r.squared,summary(reg4)$r.squared)
[1] 0.6665425 0.6662420 0.6663240 0.6667073
Finally, Fisher’s F statistics is also (almost) the same.
+ c(summary(reg1)$fstatistic[1],summary(reg2)$fstatistic[1],
+ summary(reg3)$fstatistic[1],summary(reg4)$fstatistic[1])
value    value    value    value
17.98994 17.96565 17.97228 18.00329 
Thus, with the following datasets, we have the same prediction (and the same confidence intervals). Consider for instance the second dataset (the first one being mentioned above),
> reg2=lm(y2~x2,data=anscombe)
The output is here exactly the same as the one we had above
> summary(reg2)
 
Call:
lm(formula = y2 ~ x2, data = anscombe)
 
Residuals:
Min      1Q  Median      3Q     Max
-1.9009 -0.7609  0.1291  0.9491  1.2691
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    3.001      1.125   2.667  0.02576 *
x2             0.500      0.118   4.239  0.00218 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6662,	Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179 
Here, the perfect model is the one obtained with a quadratic regression.
> reg2b=lm(y2~x2+I(x2^2),data=anscombe)
> summary(reg2b)
 
Call:
lm(formula = y2 ~ x2 + I(x2^2), data = anscombe)
 
Residuals:
Min         1Q     Median         3Q        Max
-0.0013287 -0.0011888 -0.0006294  0.0008741  0.0023776
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.9957343  0.0043299   -1385   <2e-16 ***
x2           2.7808392  0.0010401    2674   <2e-16 ***
I(x2^2)     -0.1267133  0.0000571   -2219   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.001672 on 8 degrees of freedom
Multiple R-squared:     1,	Adjusted R-squared:     1
F-statistic: 7.378e+06 on 2 and 8 DF,  p-value: < 2.2e-16 
Consider now the third one
> reg3=lm(y3~x3,data=anscombe)
i.e.
> summary(reg3)
 
Call:
lm(formula = y3 ~ x3, data = anscombe)
 
Residuals:
Min      1Q  Median      3Q     Max
-1.1586 -0.6146 -0.2303  0.1540  3.2411
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.0025     1.1245   2.670  0.02562 *
x3            0.4997     0.1179   4.239  0.00218 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared: 0.6663,	Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176 
This time, the linear model could have been perfect. The problem is one outlier. If we remove it, we have
> reg3b=lm(y3~x3,data=anscombe[-3,])
> summary(reg3b)
 
Call:
lm(formula = y3 ~ x3, data = anscombe[-3, ])
 
Residuals:
Min         1Q     Median         3Q        Max
-0.0041558 -0.0022240  0.0000649  0.0018182  0.0050649
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0056494  0.0029242    1370   <2e-16 ***
x3          0.3453896  0.0003206    1077   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.003082 on 8 degrees of freedom
Multiple R-squared:     1,	Adjusted R-squared:     1
F-statistic: 1.161e+06 on 1 and 8 DF,  p-value: < 2.2e-16 
Finally consider
> reg4=lm(y4~x4,data=anscombe)
This time, there is an other kind of outlier, in 's, but again, the regression is exactly the same,
> summary(reg4)
 
Call:
lm(formula = y4 ~ x4, data = anscombe)
 
Residuals:
Min     1Q Median     3Q    Max
-1.751 -0.831  0.000  0.809  1.839
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.0017     1.1239   2.671  0.02559 *
x4            0.4999     0.1178   4.243  0.00216 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared: 0.6667,	Adjusted R-squared: 0.6297
F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165 
The graph is here
So clearly, looking at the summary of a regression does not tell us anything... This is why we do spend some time on diagnostic, looking at graphs with the errors (the graphs above could be obtained only with one explanatory variable, while errors can be studied in any dimension): everything can be seen on thise graphs. E.g. for the first dataset,
or the second one
the third one
or the fourth one,

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)