On Box-Cox transform in regression models

November 13, 2012
By

(This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers)

A few days ago, a former student of mine, David, contacted me about Box-Cox tests in linear models. It made me look more carefully at the test, and I do not understand what is computed, to be honest. Let us start with something simple, like a linear simple regression, i.e.

http://latex.codecogs.com/gif.latex?Y_i=\beta_0+\beta_1%20X_i+\varepsilon_i

Let us introduced - as suggested in Box & Cox (1964) - the following family of (power) transformations

http://latex.codecogs.com/gif.latex?Y_i^{(\lambda)}%20=%20\begin{cases}%20\dfrac{Y_i^\lambda-1}{\lambda}%20&\text{%20%20}%20(\lambda%20\neq%200)\\[8pt]%20\log{(Y_i)}%20%20&\text{%20}%20(\lambda%20=%200)%20\end{cases}

on the variable of interest. Then assume that

http://latex.codecogs.com/gif.latex?Y_i^{(\lambda)}=\beta_0+\beta_1%20X_i+\varepsilon_i

As mentioned in Chapter 14 of Davidson & MacKinnon (1993) - in French - the log-likelihood of this model (assuming that observations are independent, with distribution http://latex.codecogs.com/gif.latex?\varepsilon_i\sim\mathcal{N}(0,\sigma^2)) can be written

http://latex.codecogs.com/gif.latex?\log%20\mathcal{L}=-\frac{n}{2}\log(2\pi)-n\log\sigma%20\\-\frac{1}{2\sigma^2}\sum_{i=1}^n\left[Y_i^{(\lambda)}-(\beta_0+\beta_1%20X_i)\right]^2+(\lambda-1)\sum_{i=1}^n\log%20Y_i

We can then use profile-likelihood techniques (see here) to derive the optimal transformation.

This can be done in R extremely simply,

> library(MASS)
> boxcox(lm(dist~speed,data=cars),lambda=seq(0,1,by=.1))

we then get the following graph,

If we look at the code of the function, it is based on the QR decomposition of the http://latex.codecogs.com/gif.latex?\boldsymbol{X}matrix (since we assume that http://latex.codecogs.com/gif.latex?\boldsymbol{X} is a full-rank matrix). More precisely, http://latex.codecogs.com/gif.latex?\boldsymbol{X}=QR where http://latex.codecogs.com/gif.latex?\boldsymbol{X} is a http://latex.codecogs.com/gif.latex?n\times%202 matrix, http://latex.codecogs.com/gif.latex?Q is a http://latex.codecogs.com/gif.latex?n\times%202 orthonornal matrix, and http://latex.codecogs.com/gif.latex?R is a http://latex.codecogs.com/gif.latex?2\times2 upper triangle matrix. It might be convenient to use this matrix since, for instance, http://latex.codecogs.com/gif.latex?R\widehat{\boldsymbol{\beta}}=Q%27Y.  Thus, we do have an upper triangle system of equations.

> X=lm(dist~speed,data=cars)$qr

The code used to get the previous graph is (more or less) the following,

> g=function(x,lambda){
+ y=NA
+ if(lambda!=0){y=(x^lambda-1)/lambda}
+ if(lambda==0){y=log(x)}
+ return(y)} 
> n=nrow(cars)
> X=lm(dist~speed,data=cars)$qr
> Y=cars$dist
> logv=function(lambda){
+ -n/2*log(sum(qr.resid(X, g(Y,lambda)/
+ exp(mean(log(Y)))^(lambda-1))^2))}
> L=seq(0,1,by=.05)
> LV=Vectorize(logv)(L)
> points(L,LV,pch=19,cex=.85,col="red")

As we can see (with those red dots) we can reproduce the R graph. But it might not be consistent with other techniques (and functions described above). For instance, we can plot the profile likelihood function, http://latex.codecogs.com/gif.latex?\lambda\mapsto\log\mathcal{L}

> logv=function(lambda){
+ s=summary(lm(g(dist,lambda)~speed,
+ data=cars))$sigma
+ e=lm(g(dist,lambda)~speed,data=cars)$residuals
+ -n/2*log(2 * pi)-n*log(s)-.5/s^2*(sum(e^2))+
+ (lambda-1)*sum(log(Y))
+ }
> L=seq(0,1,by=.01)
> LV=Vectorize(logv)(L)
> plot(L,LV,type="l",ylab="")
> (maxf=optimize(logv,0:1,maximum=TRUE))
$maximum
[1] 0.430591
 
$objective
[1] -197.6966
 
> abline(v=maxf$maximum,lty=2)

The good point is that the optimal value of http://latex.codecogs.com/gif.latex?\lambda is the same as the one we got before. The only problem is that the http://latex.codecogs.com/gif.latex?y-axis has a different scale. And using profile likelihood techniques to derive a confidence interval will give us different results (with a larger confidence interval than the one given by the standard function),

> ic=maxf$objective-qchisq(.95,1)
> #install.packages("rootSolve")
> library(rootSolve)
> f=function(x)(logv(x)-ic)
> (lower=uniroot(f, c(0,maxf$maximum))$root)
[1] 0.1383507
> (upper=uniroot(f, c(maxf$maximum,1))$root)
[1] 0.780573
> segments(lower,ic,upper,ic,lwd=2,col="red")

Actually, it possible to rewrite the log-likelihood as

http://latex.codecogs.com/gif.latex?\mathcal{L}=\star-\frac{n}{2}\log\left[\sum_{i=1}^n\left(\frac{Y_i^{(\lambda)}-(\beta_0+\beta_1%20X_i)}{\dot{Y}^\lambda}\right)^2\right]

(let us just get rid of the constant), where

http://latex.codecogs.com/gif.latex?\dot{Y}=\exp\left[\frac{1}{n}\sum_{i=1}^n%20\log%20Y_i\right]

Here, it becomes

> logv=function(lambda){
+ e=lm(g(dist,lambda)~speed,data=cars)$residuals
+ elY=(exp(mean(log(Y))))
+ -n/2*log(sum((e/elY^lambda)^2))
+ }
>
> L=seq(0,1,by=.01)
> LV=Vectorize(logv)(L)
> plot(L,LV,type="l",ylab="")
> optimize(logv,0:1,maximum=TRUE)
$maximum
[1] 0.430591
 
$objective
[1] -47.73436

with again the same optimal value for http://latex.codecogs.com/gif.latex?\lambda, and the same confidence interval, since the function is the same, up to some additive constant.

So we have been able to derive the optimal transformation according to Box-Cox transformation, but so far, the confidence interval is not the same (it might come from the fact that here we substituted an estimator to the unknown parameter http://latex.codecogs.com/gif.latex?\sigma.

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics - Tag - R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , ,

Comments are closed.