Visualization in regression analysis

February 23, 2012
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Visualization is a key to success in regression analysis. This is one of the (many) reasons I am also suspicious when I read an article with a quantitative (econometric) analysis without any graph. Consider for instance the following dataset, obtained from http://data.worldbank.org/, with, for each country, the GDP per capita (in some common currency) and the infant mortality rate (deaths before the age of 5),

```> library(gdata)
> data1=XLS1[-(1:28),c("Country.Name","Country.Code","X2010")]
> names(data1)[3]="GDP"
> data2=XLS2[-(1:28),c("Country.Code","X2010")]
> names(data2)[2]="MORTALITY"
> data=merge(data1,data2)
Country.Code         Country.Name       GDP MORTALITY
1          ABW                Aruba        NA        NA
2          AFG          Afghanistan  1207.278     149.2
3          AGO               Angola  6119.930     160.5
4          ALB              Albania  8817.009      18.4
5          AND              Andorra        NA       3.8
6          ARE United Arab Emirates 47215.315       7.1```

If we estimate a simple linear regression –   – we get

```> regBB=lm(MORTALITY~GDP,data=data)
> summary(regBB)

Call:
lm(formula = MORTALITY ~ GDP, data = data)

Residuals:
Min     1Q Median     3Q    Max
-45.24 -29.58 -12.12  16.19 115.83

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.1008781  4.1577411  16.139  < 2e-16 ***
GDP         -0.0017887  0.0002161  -8.278 3.83e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 39.99 on 167 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.2909,	Adjusted R-squared: 0.2867
F-statistic: 68.53 on 1 and 167 DF,  p-value: 3.834e-14 ```

We can look at the scatter plot, including the linear regression line, and some confidence bounds,

```> plot(data\$GDP,data\$MORTALITY,xlab="GDP per capita",
+ ylab="Mortality rate (under 5)",cex=.5)
> text(data\$GDP,data\$MORTALITY,data\$Country.Name,pos=3)
> x=seq(-10000,100000,length=101)
> y=predict(regBB,newdata=data.frame(GDP=x),
+ interval="prediction",level = 0.9)
> lines(x,y[,1],col="red")
> lines(x,y[,2],col="red",lty=2)
> lines(x,y[,3],col="red",lty=2)```

We should be able to do a better job here. For instance, if we look at the Box-Cox profile likelihood,

`> boxcox(regBB)`

it looks like taking the logarithm of the mortality rate should be better, i.e. or :

```> regLB=lm(log(MORTALITY)~GDP,data=data)
> summary(regLB)

Call:
lm(formula = log(MORTALITY) ~ GDP, data = data)

Residuals:
Min      1Q  Median      3Q     Max
-1.3035 -0.5837 -0.1138  0.5597  3.0583

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.989e+00  7.970e-02   50.05   <2e-16 ***
GDP         -6.487e-05  4.142e-06  -15.66   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7666 on 167 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.5949,	Adjusted R-squared: 0.5925
F-statistic: 245.3 on 1 and 167 DF,  p-value: < 2.2e-16

> plot(data\$GDP,data\$MORTALITY,xlab="GDP per capita",
+ ylab="Mortality rate (under 5) log scale",cex=.5,log="y")
> text(data\$GDP,data\$MORTALITY,data\$Country.Name)
> x=seq(300,100000,length=101)
> y=exp(predict(regLB,newdata=data.frame(GDP=x)))*
+ exp(summary(regLB)\$sigma^2/2)
> lines(x,y,col="red")
> y=qlnorm(.95, meanlog=predict(regLB,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLB)\$sigma^2)
> lines(x,y,col="red",lty=2)
> y=qlnorm(.05, meanlog=predict(regLB,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLB)\$sigma^2)
> lines(x,y,col="red",lty=2)```

on the log scale or

```> plot(data\$GDP,data\$MORTALITY,xlab="GDP per capita",
+ ylab="Mortality rate (under 5) log scale",cex=.5)```

on the standard scale. Here we use quantiles of the log-normal distribution to derive confidence intervals.

But why shouldn’t we take also the logarithm of the GDP ? We can fit a model or equivalently .

```> regLL=lm(log(MORTALITY)~log(GDP),data=data)
> summary(regLL)

Call:
lm(formula = log(MORTALITY) ~ log(GDP), data = data)

Residuals:
Min       1Q   Median       3Q      Max
-1.13200 -0.38326 -0.07127  0.26610  3.02212

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.50192    0.31556   33.28   <2e-16 ***
log(GDP)    -0.83496    0.03548  -23.54   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5797 on 167 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.7684,	Adjusted R-squared: 0.767
F-statistic:   554 on 1 and 167 DF,  p-value: < 2.2e-16

> plot(data\$GDP,data\$MORTALITY,xlab="GDP per capita ",
+ ylab="Mortality rate (under 5)",cex=.5,log="xy")
> text(data\$GDP,data\$MORTALITY,data\$Country.Name)
> x=exp(seq(1,12,by=.1))
> y=exp(predict(regLL,newdata=data.frame(GDP=x)))*
+ exp(summary(regLL)\$sigma^2/2)
> lines(x,y,col="red")
> y=qlnorm(.95, meanlog=predict(regLL,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLL)\$sigma^2)
> lines(x,y,col="red",lty=2)
> y=qlnorm(.05, meanlog=predict(regLL,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLL)\$sigma^2)
> lines(x,y,col="red",lty=2)```

on the log scales or

```> plot(data\$GDP,data\$MORTALITY,xlab="GDP per capita ",
+ ylab="Mortality rate (under 5)",cex=.5)```

on the standard scale. If we compare the last two predictions, we have

with in blue is the log model, and in red is the log-log model (I did not include the first one for obvious reasons).

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.