Visualization in regression analysis

February 23, 2012
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

Visualization is a key to success in regression analysis. This is one of the (many) reasons I am also suspicious when I read an article with a quantitative (econometric) analysis without any graph. Consider for instance the following dataset, obtained from http://data.worldbank.org/, with, for each country, the GDP per capita (in some common currency) and the infant mortality rate (deaths before the age of 5),

> library(gdata)
> XLS1=read.xls("http://api.worldbank.org/datafiles
/NY.GDP.PCAP.PP.CD_Indicator_MetaData_en_EXCEL.xls", sheet = 1)
> data1=XLS1[-(1:28),c("Country.Name","Country.Code","X2010")]
> names(data1)[3]="GDP"
> XLS2=read.xls("http://api.worldbank.org/datafiles
/SH.DYN.MORT_Indicator_MetaData_en_EXCEL.xls", sheet = 1)
> data2=XLS2[-(1:28),c("Country.Code","X2010")]
> names(data2)[2]="MORTALITY"
> data=merge(data1,data2)
> head(data)
Country.Code         Country.Name       GDP MORTALITY
1          ABW                Aruba        NA        NA
2          AFG          Afghanistan  1207.278     149.2
3          AGO               Angola  6119.930     160.5
4          ALB              Albania  8817.009      18.4
5          AND              Andorra        NA       3.8
6          ARE United Arab Emirates 47215.315       7.1

If we estimate a simple linear regression – http://freakonometrics.blog.free.fr/public/perso5/logormal01.gif  – we get

> regBB=lm(MORTALITY~GDP,data=data)
> summary(regBB)

Call:
lm(formula = MORTALITY ~ GDP, data = data)

Residuals:
Min     1Q Median     3Q    Max
-45.24 -29.58 -12.12  16.19 115.83

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.1008781  4.1577411  16.139  < 2e-16 ***
GDP         -0.0017887  0.0002161  -8.278 3.83e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 39.99 on 167 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.2909,	Adjusted R-squared: 0.2867
F-statistic: 68.53 on 1 and 167 DF,  p-value: 3.834e-14

We can look at the scatter plot, including the linear regression line, and some confidence bounds,

> plot(data$GDP,data$MORTALITY,xlab="GDP per capita",
+ ylab="Mortality rate (under 5)",cex=.5)
> text(data$GDP,data$MORTALITY,data$Country.Name,pos=3)
> x=seq(-10000,100000,length=101)
> y=predict(regBB,newdata=data.frame(GDP=x),
+ interval="prediction",level = 0.9)
> lines(x,y[,1],col="red")
> lines(x,y[,2],col="red",lty=2)
> lines(x,y[,3],col="red",lty=2)

We should be able to do a better job here. For instance, if we look at the Box-Cox profile likelihood,

> boxcox(regBB)

it looks like taking the logarithm of the mortality rate should be better, i.e. http://freakonometrics.blog.free.fr/public/perso5/lognormal02.gif or http://freakonometrics.blog.free.fr/public/perso5/lognormal05.gif:

> regLB=lm(log(MORTALITY)~GDP,data=data)
> summary(regLB)

Call:
lm(formula = log(MORTALITY) ~ GDP, data = data)

Residuals:
Min      1Q  Median      3Q     Max
-1.3035 -0.5837 -0.1138  0.5597  3.0583

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.989e+00  7.970e-02   50.05   <2e-16 ***
GDP         -6.487e-05  4.142e-06  -15.66   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7666 on 167 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.5949,	Adjusted R-squared: 0.5925
F-statistic: 245.3 on 1 and 167 DF,  p-value: < 2.2e-16

> plot(data$GDP,data$MORTALITY,xlab="GDP per capita",
+ ylab="Mortality rate (under 5) log scale",cex=.5,log="y")
> text(data$GDP,data$MORTALITY,data$Country.Name)
> x=seq(300,100000,length=101)
> y=exp(predict(regLB,newdata=data.frame(GDP=x)))*
+ exp(summary(regLB)$sigma^2/2)
> lines(x,y,col="red")
> y=qlnorm(.95, meanlog=predict(regLB,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLB)$sigma^2)
> lines(x,y,col="red",lty=2)
> y=qlnorm(.05, meanlog=predict(regLB,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLB)$sigma^2)
> lines(x,y,col="red",lty=2)

on the log scale or

> plot(data$GDP,data$MORTALITY,xlab="GDP per capita",
+ ylab="Mortality rate (under 5) log scale",cex=.5)

on the standard scale. Here we use quantiles of the log-normal distribution to derive confidence intervals.

But why shouldn’t we take also the logarithm of the GDP ? We can fit a model http://freakonometrics.blog.free.fr/public/perso5/lognormal03.gif or equivalently http://freakonometrics.blog.free.fr/public/perso5/lognormal04.gif.

> regLL=lm(log(MORTALITY)~log(GDP),data=data)
> summary(regLL)

Call:
lm(formula = log(MORTALITY) ~ log(GDP), data = data)

Residuals:
Min       1Q   Median       3Q      Max
-1.13200 -0.38326 -0.07127  0.26610  3.02212

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.50192    0.31556   33.28   <2e-16 ***
log(GDP)    -0.83496    0.03548  -23.54   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5797 on 167 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.7684,	Adjusted R-squared: 0.767
F-statistic:   554 on 1 and 167 DF,  p-value: < 2.2e-16

> plot(data$GDP,data$MORTALITY,xlab="GDP per capita ",
+ ylab="Mortality rate (under 5)",cex=.5,log="xy")
> text(data$GDP,data$MORTALITY,data$Country.Name)
> x=exp(seq(1,12,by=.1))
> y=exp(predict(regLL,newdata=data.frame(GDP=x)))*
+ exp(summary(regLL)$sigma^2/2)
> lines(x,y,col="red")
> y=qlnorm(.95, meanlog=predict(regLL,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLL)$sigma^2)
> lines(x,y,col="red",lty=2)
> y=qlnorm(.05, meanlog=predict(regLL,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLL)$sigma^2)
> lines(x,y,col="red",lty=2)

on the log scales or

> plot(data$GDP,data$MORTALITY,xlab="GDP per capita ",
+ ylab="Mortality rate (under 5)",cex=.5)

on the standard scale. If we compare the last two predictions, we have

with in blue is the log model, and in red is the log-log model (I did not include the first one for obvious reasons).

Arthur Charpentier

Arthur Charpentier, professor at UQaM in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1. Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.

More Posts - Website

Follow Me:
TwitterLinkedInGoogle Plus

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.