Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Visualization is a key to success in regression analysis. This is one of the (many) reasons I am also suspicious when I read an article with a quantitative (econometric) analysis without any graph. Consider for instance the following dataset, obtained from http://data.worldbank.org/, with, for each country, the GDP per capita (in some common currency) and the infant mortality rate (deaths before the age of 5),
> library(gdata)
> data1=XLS1[-(1:28),c("Country.Name","Country.Code","X2010")]
> names(data1)="GDP"
> data2=XLS2[-(1:28),c("Country.Code","X2010")]
> names(data2)="MORTALITY"
> data=merge(data1,data2)
Country.Code         Country.Name       GDP MORTALITY
1          ABW                Aruba        NA        NA
2          AFG          Afghanistan  1207.278     149.2
3          AGO               Angola  6119.930     160.5
4          ALB              Albania  8817.009      18.4
5          AND              Andorra        NA       3.8
6          ARE United Arab Emirates 47215.315       7.1

If we estimate a simple linear regression – – we get
> regBB=lm(MORTALITY~GDP,data=data)
> summary(regBB)

Call:
lm(formula = MORTALITY ~ GDP, data = data)

Residuals:
Min     1Q Median     3Q    Max
-45.24 -29.58 -12.12  16.19 115.83

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.1008781  4.1577411  16.139  < 2e-16 ***
GDP         -0.0017887  0.0002161  -8.278 3.83e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 39.99 on 167 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.2909,	Adjusted R-squared: 0.2867
F-statistic: 68.53 on 1 and 167 DF,  p-value: 3.834e-14 
We can look at the scatter plot, including the linear regression line, and some confidence bounds,
> plot(data$GDP,data$MORTALITY,xlab="GDP per capita",
+ ylab="Mortality rate (under 5)",cex=.5)
> text(data$GDP,data$MORTALITY,data$Country.Name,pos=3) > x=seq(-10000,100000,length=101) > y=predict(regBB,newdata=data.frame(GDP=x), + interval="prediction",level = 0.9) > lines(x,y[,1],col="red") > lines(x,y[,2],col="red",lty=2) > lines(x,y[,3],col="red",lty=2) We should be able to do a better job here. For instance, if we look at the Box-Cox profile likelihood, > boxcox(regBB) it looks like taking the logarithm of the mortality rate should be better, i.e. or : > regLB=lm(log(MORTALITY)~GDP,data=data) > summary(regLB) Call: lm(formula = log(MORTALITY) ~ GDP, data = data) Residuals: Min 1Q Median 3Q Max -1.3035 -0.5837 -0.1138 0.5597 3.0583 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.989e+00 7.970e-02 50.05 <2e-16 *** GDP -6.487e-05 4.142e-06 -15.66 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7666 on 167 degrees of freedom (47 observations deleted due to missingness) Multiple R-squared: 0.5949, Adjusted R-squared: 0.5925 F-statistic: 245.3 on 1 and 167 DF, p-value: < 2.2e-16 > plot(data$GDP,data$MORTALITY,xlab="GDP per capita", + ylab="Mortality rate (under 5) log scale",cex=.5,log="y") > text(data$GDP,data$MORTALITY,data$Country.Name)
> x=seq(300,100000,length=101)
> y=exp(predict(regLB,newdata=data.frame(GDP=x)))*
+ exp(summary(regLB)$sigma^2/2) > lines(x,y,col="red") > y=qlnorm(.95, meanlog=predict(regLB,newdata=data.frame(GDP=x)), + sdlog=summary(regLB)$sigma^2)
> lines(x,y,col="red",lty=2)
> y=qlnorm(.05, meanlog=predict(regLB,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLB)$sigma^2) > lines(x,y,col="red",lty=2) on the log scale or > plot(data$GDP,data$MORTALITY,xlab="GDP per capita", + ylab="Mortality rate (under 5) log scale",cex=.5) on the standard scale. Here we use quantiles of the log-normal distribution to derive confidence intervals. But why shouldn't we take also the logarithm of the GDP ? We can fit a model or equivalently . > regLL=lm(log(MORTALITY)~log(GDP),data=data) > summary(regLL) Call: lm(formula = log(MORTALITY) ~ log(GDP), data = data) Residuals: Min 1Q Median 3Q Max -1.13200 -0.38326 -0.07127 0.26610 3.02212 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.50192 0.31556 33.28 <2e-16 *** log(GDP) -0.83496 0.03548 -23.54 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5797 on 167 degrees of freedom (47 observations deleted due to missingness) Multiple R-squared: 0.7684, Adjusted R-squared: 0.767 F-statistic: 554 on 1 and 167 DF, p-value: < 2.2e-16 > plot(data$GDP,data$MORTALITY,xlab="GDP per capita ", + ylab="Mortality rate (under 5)",cex=.5,log="xy") > text(data$GDP,data$MORTALITY,data$Country.Name)
> x=exp(seq(1,12,by=.1))
> y=exp(predict(regLL,newdata=data.frame(GDP=x)))*
+ exp(summary(regLL)$sigma^2/2) > lines(x,y,col="red") > y=qlnorm(.95, meanlog=predict(regLL,newdata=data.frame(GDP=x)), + sdlog=summary(regLL)$sigma^2)
> lines(x,y,col="red",lty=2)
> y=qlnorm(.05, meanlog=predict(regLL,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLL)$sigma^2) > lines(x,y,col="red",lty=2) on the log scales or > plot(data$GDP,data\$MORTALITY,xlab="GDP per capita ",
+ ylab="Mortality rate (under 5)",cex=.5) on the standard scale. If we compare the last two predictions, we have with in blue is the log model, and in red is the log-log model (I did not include the first one for obvious reasons).