Visualization in regression analysis

[This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Visualization is a key to success in regression analysis. This is one of the (many) reasons I am also suspicious when I read an article with a quantitative (econometric) analysis without any graph. Consider for instance the following dataset, obtained from http://data.worldbank.org/, with, for each country, the GDP per capita (in some common currency) and the infant mortality rate (deaths before the age of 5),
> library(gdata)
> XLS1=read.xls("http://api.worldbank.org/datafiles
/NY.GDP.PCAP.PP.CD_Indicator_MetaData_en_EXCEL.xls", sheet = 1)
> data1=XLS1[-(1:28),c("Country.Name","Country.Code","X2010")]
> names(data1)[3]="GDP"
> XLS2=read.xls("http://api.worldbank.org/datafiles
/SH.DYN.MORT_Indicator_MetaData_en_EXCEL.xls", sheet = 1)
> data2=XLS2[-(1:28),c("Country.Code","X2010")]
> names(data2)[2]="MORTALITY"
> data=merge(data1,data2)
> head(data)
Country.Code         Country.Name       GDP MORTALITY
1          ABW                Aruba        NA        NA
2          AFG          Afghanistan  1207.278     149.2
3          AGO               Angola  6119.930     160.5
4          ALB              Albania  8817.009      18.4
5          AND              Andorra        NA       3.8
6          ARE United Arab Emirates 47215.315       7.1

If we estimate a simple linear regression – https://i2.wp.com/freakonometrics.blog.free.fr/public/perso5/logormal01.gif?w=578  – we get
> regBB=lm(MORTALITY~GDP,data=data)
> summary(regBB)
 
Call:
lm(formula = MORTALITY ~ GDP, data = data)
 
Residuals:
Min     1Q Median     3Q    Max
-45.24 -29.58 -12.12  16.19 115.83
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.1008781  4.1577411  16.139  < 2e-16 ***
GDP         -0.0017887  0.0002161  -8.278 3.83e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 39.99 on 167 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.2909,	Adjusted R-squared: 0.2867
F-statistic: 68.53 on 1 and 167 DF,  p-value: 3.834e-14 
We can look at the scatter plot, including the linear regression line, and some confidence bounds,
> plot(data$GDP,data$MORTALITY,xlab="GDP per capita",
+ ylab="Mortality rate (under 5)",cex=.5)
> text(data$GDP,data$MORTALITY,data$Country.Name,pos=3)
> x=seq(-10000,100000,length=101)
> y=predict(regBB,newdata=data.frame(GDP=x),
+ interval="prediction",level = 0.9)
> lines(x,y[,1],col="red")
> lines(x,y[,2],col="red",lty=2)
> lines(x,y[,3],col="red",lty=2)

We should be able to do a better job here. For instance, if we look at the Box-Cox profile likelihood,

> boxcox(regBB)

it looks like taking the logarithm of the mortality rate should be better, i.e. https://i0.wp.com/freakonometrics.blog.free.fr/public/perso5/lognormal02.gif?w=578 or https://i2.wp.com/freakonometrics.blog.free.fr/public/perso5/lognormal05.gif?w=578:

> regLB=lm(log(MORTALITY)~GDP,data=data)
> summary(regLB)
 
Call:
lm(formula = log(MORTALITY) ~ GDP, data = data)
 
Residuals:
Min      1Q  Median      3Q     Max
-1.3035 -0.5837 -0.1138  0.5597  3.0583
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.989e+00  7.970e-02   50.05   <2e-16 ***
GDP         -6.487e-05  4.142e-06  -15.66   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.7666 on 167 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.5949,	Adjusted R-squared: 0.5925
F-statistic: 245.3 on 1 and 167 DF,  p-value: < 2.2e-16
 
> plot(data$GDP,data$MORTALITY,xlab="GDP per capita",
+ ylab="Mortality rate (under 5) log scale",cex=.5,log="y")
> text(data$GDP,data$MORTALITY,data$Country.Name)
> x=seq(300,100000,length=101)
> y=exp(predict(regLB,newdata=data.frame(GDP=x)))*
+ exp(summary(regLB)$sigma^2/2)
> lines(x,y,col="red")
> y=qlnorm(.95, meanlog=predict(regLB,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLB)$sigma^2)
> lines(x,y,col="red",lty=2)
> y=qlnorm(.05, meanlog=predict(regLB,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLB)$sigma^2)
> lines(x,y,col="red",lty=2)

on the log scale or

> plot(data$GDP,data$MORTALITY,xlab="GDP per capita",
+ ylab="Mortality rate (under 5) log scale",cex=.5)

on the standard scale. Here we use quantiles of the log-normal distribution to derive confidence intervals.

But why shouldn't we take also the logarithm of the GDP ? We can fit a model https://i1.wp.com/freakonometrics.blog.free.fr/public/perso5/lognormal03.gif?w=578 or equivalently https://i1.wp.com/freakonometrics.blog.free.fr/public/perso5/lognormal04.gif?w=578.


> regLL=lm(log(MORTALITY)~log(GDP),data=data)
> summary(regLL)
 
Call:
lm(formula = log(MORTALITY) ~ log(GDP), data = data)
 
Residuals:
Min       1Q   Median       3Q      Max
-1.13200 -0.38326 -0.07127  0.26610  3.02212
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.50192    0.31556   33.28   <2e-16 ***
log(GDP)    -0.83496    0.03548  -23.54   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.5797 on 167 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.7684,	Adjusted R-squared: 0.767
F-statistic:   554 on 1 and 167 DF,  p-value: < 2.2e-16
 
> plot(data$GDP,data$MORTALITY,xlab="GDP per capita ",
+ ylab="Mortality rate (under 5)",cex=.5,log="xy")
> text(data$GDP,data$MORTALITY,data$Country.Name)
> x=exp(seq(1,12,by=.1))
> y=exp(predict(regLL,newdata=data.frame(GDP=x)))*
+ exp(summary(regLL)$sigma^2/2)
> lines(x,y,col="red")
> y=qlnorm(.95, meanlog=predict(regLL,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLL)$sigma^2)
> lines(x,y,col="red",lty=2)
> y=qlnorm(.05, meanlog=predict(regLL,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLL)$sigma^2)
> lines(x,y,col="red",lty=2)

on the log scales or

> plot(data$GDP,data$MORTALITY,xlab="GDP per capita ",
+ ylab="Mortality rate (under 5)",cex=.5)

on the standard scale. If we compare the last two predictions, we have

with in blue is the log model, and in red is the log-log model (I did not include the first one for obvious reasons).

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)