# On Box-Cox transform in regression models

November 13, 2012
By

(This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers)

A few days ago, a former student of mine, David, contacted me about Box-Cox tests in linear models. It made me look more carefully at the test, and I do not understand what is computed, to be honest. Let us start with something simple, like a linear simple regression, i.e.

$Y_i=\beta_0+\beta_1 X_i+\varepsilon_i$

Let us introduced - as suggested in Box & Cox (1964) - the following family of (power) transformations

$Y_i^{(\lambda)} = \begin{cases} \dfrac{Y_i^\lambda-1}{\lambda} &\text{ } (\lambda \neq 0)\\[8pt] \log{(Y_i)} &\text{ } (\lambda = 0) \end{cases}$

on the variable of interest. Then assume that

$Y_i^{(\lambda)}=\beta_0+\beta_1 X_i+\varepsilon_i$

As mentioned in Chapter 14 of Davidson & MacKinnon (1993) - in French - the log-likelihood of this model (assuming that observations are independent, with distribution $\varepsilon_i\sim\mathcal{N}(0,\sigma^2)$) can be written

$\log \mathcal{L}=-\frac{n}{2}\log(2\pi)-n\log\sigma \\-\frac{1}{2\sigma^2}\sum_{i=1}^n\left[Y_i^{(\lambda)}-(\beta_0+\beta_1 X_i)\right]^2+(\lambda-1)\sum_{i=1}^n\log Y_i$

We can then use profile-likelihood techniques (see here) to derive the optimal transformation.

This can be done in R extremely simply,

> library(MASS)
> boxcox(lm(dist~speed,data=cars),lambda=seq(0,1,by=.1))

we then get the following graph,

If we look at the code of the function, it is based on the QR decomposition of the $\boldsymbol{X}$matrix (since we assume that $\boldsymbol{X}$ is a full-rank matrix). More precisely, $\boldsymbol{X}=QR$ where $\boldsymbol{X}$ is a $n\times 2$ matrix, $Q$ is a $n\times 2$ orthonornal matrix, and $R$ is a $2\times2$ upper triangle matrix. It might be convenient to use this matrix since, for instance, $R\widehat{\boldsymbol{\beta}}=Q'Y$.  Thus, we do have an upper triangle system of equations.

> X=lm(dist~speed,data=cars)$qr The code used to get the previous graph is (more or less) the following, > g=function(x,lambda){ + y=NA + if(lambda!=0){y=(x^lambda-1)/lambda} + if(lambda==0){y=log(x)} + return(y)} > n=nrow(cars) > X=lm(dist~speed,data=cars)$qr
> Y=cars$dist > logv=function(lambda){ + -n/2*log(sum(qr.resid(X, g(Y,lambda)/ + exp(mean(log(Y)))^(lambda-1))^2))} > L=seq(0,1,by=.05) > LV=Vectorize(logv)(L) > points(L,LV,pch=19,cex=.85,col="red") As we can see (with those red dots) we can reproduce the R graph. But it might not be consistent with other techniques (and functions described above). For instance, we can plot the profile likelihood function, $\lambda\mapsto\log\mathcal{L}$ > logv=function(lambda){ + s=summary(lm(g(dist,lambda)~speed, + data=cars))$sigma
+ e=lm(g(dist,lambda)~speed,data=cars)$residuals + -n/2*log(2 * pi)-n*log(s)-.5/s^2*(sum(e^2))+ + (lambda-1)*sum(log(Y)) + } > L=seq(0,1,by=.01) > LV=Vectorize(logv)(L) > plot(L,LV,type="l",ylab="") > (maxf=optimize(logv,0:1,maximum=TRUE))$maximum
[1] 0.430591

$objective [1] -197.6966 > abline(v=maxf$maximum,lty=2)

The good point is that the optimal value of $\lambda$ is the same as the one we got before. The only problem is that the $y$-axis has a different scale. And using profile likelihood techniques to derive a confidence interval will give us different results (with a larger confidence interval than the one given by the standard function),

> ic=maxf$objective-qchisq(.95,1) > #install.packages("rootSolve") > library(rootSolve) > f=function(x)(logv(x)-ic) > (lower=uniroot(f, c(0,maxf$maximum))$root) [1] 0.1383507 > (upper=uniroot(f, c(maxf$maximum,1))$root) [1] 0.780573 > segments(lower,ic,upper,ic,lwd=2,col="red") Actually, it possible to rewrite the log-likelihood as $\mathcal{L}=\star-\frac{n}{2}\log\left[\sum_{i=1}^n\left(\frac{Y_i^{(\lambda)}-(\beta_0+\beta_1 X_i)}{\dot{Y}^\lambda}\right)^2\right]$ (let us just get rid of the constant), where $\dot{Y}=\exp\left[\frac{1}{n}\sum_{i=1}^n \log Y_i\right]$ Here, it becomes > logv=function(lambda){ + e=lm(g(dist,lambda)~speed,data=cars)$residuals
+ elY=(exp(mean(log(Y))))
+ -n/2*log(sum((e/elY^lambda)^2))
+ }
>
> L=seq(0,1,by=.01)
> LV=Vectorize(logv)(L)
> plot(L,LV,type="l",ylab="")
> optimize(logv,0:1,maximum=TRUE)
$maximum [1] 0.430591$objective
[1] -47.73436

with again the same optimal value for $\lambda$, and the same confidence interval, since the function is the same, up to some additive constant.

So we have been able to derive the optimal transformation according to Box-Cox transformation, but so far, the confidence interval is not the same (it might come from the fact that here we substituted an estimator to the unknown parameter $\sigma$.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...