# Visualization in regression analysis

February 23, 2012
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

Visualization is a key to success in regression analysis. This is one of the (many) reasons I am also suspicious when I read an article with a quantitative (econometric) analysis without any graph. Consider for instance the following dataset, obtained from http://data.worldbank.org/, with, for each country, the GDP per capita (in some common currency) and the infant mortality rate (deaths before the age of 5),

> library(gdata)
> data1=XLS1[-(1:28),c("Country.Name","Country.Code","X2010")]
> names(data1)[3]="GDP"
> data2=XLS2[-(1:28),c("Country.Code","X2010")]
> names(data2)[2]="MORTALITY"
> data=merge(data1,data2)
Country.Code         Country.Name       GDP MORTALITY
1          ABW                Aruba        NA        NA
2          AFG          Afghanistan  1207.278     149.2
3          AGO               Angola  6119.930     160.5
4          ALB              Albania  8817.009      18.4
5          AND              Andorra        NA       3.8
6          ARE United Arab Emirates 47215.315       7.1

If we estimate a simple linear regression –   – we get

> regBB=lm(MORTALITY~GDP,data=data)
> summary(regBB)

Call:
lm(formula = MORTALITY ~ GDP, data = data)

Residuals:
Min     1Q Median     3Q    Max
-45.24 -29.58 -12.12  16.19 115.83

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.1008781  4.1577411  16.139  < 2e-16 ***
GDP         -0.0017887  0.0002161  -8.278 3.83e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 39.99 on 167 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.2909,	Adjusted R-squared: 0.2867
F-statistic: 68.53 on 1 and 167 DF,  p-value: 3.834e-14

We can look at the scatter plot, including the linear regression line, and some confidence bounds,

> plot(data$GDP,data$MORTALITY,xlab="GDP per capita",
+ ylab="Mortality rate (under 5)",cex=.5)
> text(data$GDP,data$MORTALITY,data$Country.Name,pos=3) > x=seq(-10000,100000,length=101) > y=predict(regBB,newdata=data.frame(GDP=x), + interval="prediction",level = 0.9) > lines(x,y[,1],col="red") > lines(x,y[,2],col="red",lty=2) > lines(x,y[,3],col="red",lty=2) We should be able to do a better job here. For instance, if we look at the Box-Cox profile likelihood, > boxcox(regBB) it looks like taking the logarithm of the mortality rate should be better, i.e. or : > regLB=lm(log(MORTALITY)~GDP,data=data) > summary(regLB) Call: lm(formula = log(MORTALITY) ~ GDP, data = data) Residuals: Min 1Q Median 3Q Max -1.3035 -0.5837 -0.1138 0.5597 3.0583 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.989e+00 7.970e-02 50.05 <2e-16 *** GDP -6.487e-05 4.142e-06 -15.66 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7666 on 167 degrees of freedom (47 observations deleted due to missingness) Multiple R-squared: 0.5949, Adjusted R-squared: 0.5925 F-statistic: 245.3 on 1 and 167 DF, p-value: < 2.2e-16 > plot(data$GDP,data$MORTALITY,xlab="GDP per capita", + ylab="Mortality rate (under 5) log scale",cex=.5,log="y") > text(data$GDP,data$MORTALITY,data$Country.Name)
> x=seq(300,100000,length=101)
> y=exp(predict(regLB,newdata=data.frame(GDP=x)))*
+ exp(summary(regLB)$sigma^2/2) > lines(x,y,col="red") > y=qlnorm(.95, meanlog=predict(regLB,newdata=data.frame(GDP=x)), + sdlog=summary(regLB)$sigma^2)
> lines(x,y,col="red",lty=2)
> y=qlnorm(.05, meanlog=predict(regLB,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLB)$sigma^2) > lines(x,y,col="red",lty=2) on the log scale or > plot(data$GDP,data$MORTALITY,xlab="GDP per capita", + ylab="Mortality rate (under 5) log scale",cex=.5) on the standard scale. Here we use quantiles of the log-normal distribution to derive confidence intervals. But why shouldn’t we take also the logarithm of the GDP ? We can fit a model or equivalently . > regLL=lm(log(MORTALITY)~log(GDP),data=data) > summary(regLL) Call: lm(formula = log(MORTALITY) ~ log(GDP), data = data) Residuals: Min 1Q Median 3Q Max -1.13200 -0.38326 -0.07127 0.26610 3.02212 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.50192 0.31556 33.28 <2e-16 *** log(GDP) -0.83496 0.03548 -23.54 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5797 on 167 degrees of freedom (47 observations deleted due to missingness) Multiple R-squared: 0.7684, Adjusted R-squared: 0.767 F-statistic: 554 on 1 and 167 DF, p-value: < 2.2e-16 > plot(data$GDP,data$MORTALITY,xlab="GDP per capita ", + ylab="Mortality rate (under 5)",cex=.5,log="xy") > text(data$GDP,data$MORTALITY,data$Country.Name)
> x=exp(seq(1,12,by=.1))
> y=exp(predict(regLL,newdata=data.frame(GDP=x)))*
+ exp(summary(regLL)$sigma^2/2) > lines(x,y,col="red") > y=qlnorm(.95, meanlog=predict(regLL,newdata=data.frame(GDP=x)), + sdlog=summary(regLL)$sigma^2)
> lines(x,y,col="red",lty=2)
> y=qlnorm(.05, meanlog=predict(regLL,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLL)$sigma^2) > lines(x,y,col="red",lty=2) on the log scales or > plot(data$GDP,data\$MORTALITY,xlab="GDP per capita ",
+ ylab="Mortality rate (under 5)",cex=.5)

on the standard scale. If we compare the last two predictions, we have

with in blue is the log model, and in red is the log-log model (I did not include the first one for obvious reasons).

### Arthur Charpentier

Arthur Charpentier, professor at UQaM in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1. Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.