(This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers)
Visualization is a key to success in regression analysis. This is one of the (many) reasons I am also suspicious when I read an article with a quantitative (econometric) analysis without any graph. Consider for instance the following dataset, obtained from http://data.worldbank.org/, with, for each country, the GDP per capita (in some common currency) and the infant mortality rate (deaths before the age of 5),> library(gdata) > XLS1=read.xls("http://api.worldbank.org/datafiles /NY.GDP.PCAP.PP.CD_Indicator_MetaData_en_EXCEL.xls", sheet = 1) > data1=XLS1[-(1:28),c("Country.Name","Country.Code","X2010")] > names(data1)[3]="GDP" > XLS2=read.xls("http://api.worldbank.org/datafiles /SH.DYN.MORT_Indicator_MetaData_en_EXCEL.xls", sheet = 1) > data2=XLS2[-(1:28),c("Country.Code","X2010")] > names(data2)[2]="MORTALITY" > data=merge(data1,data2) > head(data) Country.Code Country.Name GDP MORTALITY 1 ABW Aruba NA NA 2 AFG Afghanistan 1207.278 149.2 3 AGO Angola 6119.930 160.5 4 ALB Albania 8817.009 18.4 5 AND Andorra NA 3.8 6 ARE United Arab Emirates 47215.315 7.1
If we estimate a simple linear regression -
- we get> regBB=lm(MORTALITY~GDP,data=data) > summary(regBB) Call: lm(formula = MORTALITY ~ GDP, data = data) Residuals: Min 1Q Median 3Q Max -45.24 -29.58 -12.12 16.19 115.83 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 67.1008781 4.1577411 16.139 < 2e-16 *** GDP -0.0017887 0.0002161 -8.278 3.83e-14 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 39.99 on 167 degrees of freedom (47 observations deleted due to missingness) Multiple R-squared: 0.2909, Adjusted R-squared: 0.2867 F-statistic: 68.53 on 1 and 167 DF, p-value: 3.834e-14We can look at the scatter plot, including the linear regression line, and some confidence bounds,
> plot(data$GDP,data$MORTALITY,xlab="GDP per capita", + ylab="Mortality rate (under 5)",cex=.5) > text(data$GDP,data$MORTALITY,data$Country.Name,pos=3) > x=seq(-10000,100000,length=101) > y=predict(regBB,newdata=data.frame(GDP=x), + interval="prediction",level = 0.9) > lines(x,y[,1],col="red") > lines(x,y[,2],col="red",lty=2) > lines(x,y[,3],col="red",lty=2)

We should be able to do a better job here. For instance, if we look at the Box-Cox profile likelihood,
> boxcox(regBB)

it looks like taking the logarithm of the mortality rate should be better, i.e.
or
:
> regLB=lm(log(MORTALITY)~GDP,data=data) > summary(regLB) Call: lm(formula = log(MORTALITY) ~ GDP, data = data) Residuals: Min 1Q Median 3Q Max -1.3035 -0.5837 -0.1138 0.5597 3.0583 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.989e+00 7.970e-02 50.05 <2e-16 *** GDP -6.487e-05 4.142e-06 -15.66 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7666 on 167 degrees of freedom (47 observations deleted due to missingness) Multiple R-squared: 0.5949, Adjusted R-squared: 0.5925 F-statistic: 245.3 on 1 and 167 DF, p-value: < 2.2e-16 > plot(data$GDP,data$MORTALITY,xlab="GDP per capita", + ylab="Mortality rate (under 5) log scale",cex=.5,log="y") > text(data$GDP,data$MORTALITY,data$Country.Name) > x=seq(300,100000,length=101) > y=exp(predict(regLB,newdata=data.frame(GDP=x)))* + exp(summary(regLB)$sigma^2/2) > lines(x,y,col="red") > y=qlnorm(.95, meanlog=predict(regLB,newdata=data.frame(GDP=x)), + sdlog=summary(regLB)$sigma^2) > lines(x,y,col="red",lty=2) > y=qlnorm(.05, meanlog=predict(regLB,newdata=data.frame(GDP=x)), + sdlog=summary(regLB)$sigma^2) > lines(x,y,col="red",lty=2)

on the log scale or
> plot(data$GDP,data$MORTALITY,xlab="GDP per capita", + ylab="Mortality rate (under 5) log scale",cex=.5)

on the standard scale. Here we use quantiles of the log-normal distribution to derive confidence intervals.
But why shouldn't we take also the logarithm of the GDP ? We can fit a model
or equivalently
.
> regLL=lm(log(MORTALITY)~log(GDP),data=data) > summary(regLL) Call: lm(formula = log(MORTALITY) ~ log(GDP), data = data) Residuals: Min 1Q Median 3Q Max -1.13200 -0.38326 -0.07127 0.26610 3.02212 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.50192 0.31556 33.28 <2e-16 *** log(GDP) -0.83496 0.03548 -23.54 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5797 on 167 degrees of freedom (47 observations deleted due to missingness) Multiple R-squared: 0.7684, Adjusted R-squared: 0.767 F-statistic: 554 on 1 and 167 DF, p-value: < 2.2e-16 > plot(data$GDP,data$MORTALITY,xlab="GDP per capita ", + ylab="Mortality rate (under 5)",cex=.5,log="xy") > text(data$GDP,data$MORTALITY,data$Country.Name) > x=exp(seq(1,12,by=.1)) > y=exp(predict(regLL,newdata=data.frame(GDP=x)))* + exp(summary(regLL)$sigma^2/2) > lines(x,y,col="red") > y=qlnorm(.95, meanlog=predict(regLL,newdata=data.frame(GDP=x)), + sdlog=summary(regLL)$sigma^2) > lines(x,y,col="red",lty=2) > y=qlnorm(.05, meanlog=predict(regLL,newdata=data.frame(GDP=x)), + sdlog=summary(regLL)$sigma^2) > lines(x,y,col="red",lty=2)

on the log scales or
> plot(data$GDP,data$MORTALITY,xlab="GDP per capita ", + ylab="Mortality rate (under 5)",cex=.5)

on the standard scale. If we compare the last two predictions, we have

with in blue is the log model, and in red is the log-log model (I did not include the first one for obvious reasons).
To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics - Tag - R-english.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).