Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When discussing transformations in regression models, I usually briefly introduce the Box-Cox transform (see e.g. an old post on that topic) and I also mention local regressions and nonparametric estimators (see e.g. another post). But while I was working on my ACT6420 course (on predictive modeling, which is a VEE for the SOA), I read something about a “Ladder of Powers Rule” also called “Tukey and Mosteller’s Bulging Rule“. To be honest, I never heard about this rule before. But that won’t be the first time I learn something while working on my notes for a course !

The point here is that, in a standard linear regression model, we have

$Y_i=\beta_0+\beta_1 X_i+\varepsilon_i$

But sometimes, a linear relationship is not appropriate. One idea can be to transform the variable we would like to model, $Y$, and to consider

$\varphi(Y_i)=\beta_0+\beta_1 X_i+\varepsilon_i$

This is what we usually do with the Box-Cox transform. Another idea can be to transform the explanatory variable, $X$, and now, consider,

$Y_i=\beta_0+\beta_1 \psi(X_i)+\varepsilon_i$

For instance, this year in the course, we considered – at some point – a continuous piecewise linear functions,

$Y_i=\beta_0+\beta_1 X_i+\beta_{2,1} (X_i-s_1)_++\beta_{2,2} (X_i-s_2)_++\beta_{2,3} (X_i-s_3)_++\varepsilon_i$

It is also possible to consider some polynomial regression. The ”Tukey and Mosteller’s Bulging Rule” is based on the following figure.

and the idea is that it might be interesting to transform $X$ and $Y$ at the same time, using some power functions. To be more specific, we will consider some linear model

$Y_{\color{black}{i}}^{\color{red}{q}}=\beta_0+\beta_1 X_{\color{black}{i}}^{\color{red}{p}}+\varepsilon_i$

for some (positive) parameters $p$ and $q$. Depending on the shape of the regression function (the four curves mentioned on the graph above, in the four quadrant) different powers will be considered.

To be more specific, let us generate different models, and let us look at the associate scatterplot,

```> fakedataMT=function(p=1,q=1,n=99,s=.1){
+ set.seed(1)
+ X=seq(1/(n+1),1-1/(n+1),length=n)
+ Y=(5+2*X^p+rnorm(n,sd=s))^(1/q)
+ return(data.frame(x=X,y=Y))}
> par(mfrow=c(2,2))
> plot(fakedataMT(p=.5,q=2),main="(p=1/2,q=2)")
> plot(fakedataMT(p=3,q=-5),main="(p=3,q=-5)")
> plot(fakedataMT(p=.5,q=-1),main="(p=1/2,q=-1)")
> plot(fakedataMT(p=3,q=5),main="(p=3,q=5)")```

If we consider the South-West part of the graph, to get such a pattern, we can consider

$Y_i^{1/2}=\beta_0+\beta_1 X_i^2+\varepsilon_i$

or more generally

$Y_i^{1/a}=\beta_0+\beta_1 X_i^b+\varepsilon_i$

where $a$ and $b$ are both larger than 1. And the larger $a$ and/or $b$, the more convex the regression curve.

Let us visualize that double transformation on a dataset, say the cars dataset.

```> base=cars
­> names(base)=c("x","y")
> MostellerTukey=function(p=1,q=1){
+ regpq=lm(I(y^q)~I(x^p),data=base)
+ u=seq(min(min(base\$x)-2,.1),max(base\$x)+2,length=501)
+ par(mfrow=c(1,2))
+ plot(base\$x,base\$y,xlab="X",ylab="Y",col="white")
+ vic=predict(regpq,newdata=data.frame(x=u),interval="prediction")
+ vic[vic<=0]=.1
+ polygon(c(u,rev(u)),c(vic[,2],rev(vic[,3]))^(1/q),col="light blue",density=40,border=NA)
+ lines(u,vic[,2]^(1/q),col="blue")
+ lines(u,vic[,3]^(1/q),col="blue")
+ v=predict(regpq,newdata=data.frame(x=u))^(1/q)
+ lines(u,v,col="blue")
+ points(base\$x,base\$y)
+
+ plot(base\$x^p,base\$y^q,xlab=paste("X^",p,sep=""),ylab=paste("Y^",q,sep=""),col="white")
+ polygon(c(u,rev(u))^p,c(vic[,2],rev(vic[,3])),col="light blue",density=40,border=NA)
+ lines(u^p,vic[,2],col="blue")
+ lines(u^p,vic[,3],col="blue")
+ abline(regpq,col="blue")
+ points(base\$x^p,base\$y^q)
+ }```

For instance, if we call

`> MostellerTukey(2,1)`

we get the following graph,

On the left, we have the original dataset, $\{(X_i,Y_i)\}$ and on the right, the transformed one, $\{(X_{\color{black}{i}}^{\color{red}{p}},Y_{\color{black}{i}}^{\color{red}{q}})\}$, with two possible transformations. Here, we did only consider the square of the speed of the car (and only one component was transformed, here). On that transformed dataset, we run a standard linear regression. We add, here, a confidence tube. And then, we consider the inverse transformation of the prediction. This line is plotted on the left. The problem is that it should not be considered as our optimal prediction, since it is clearly biased because $[\mathbb{E}(Y^{\color{red}{p}})]^{\color{red}{1/p}}\neq\mathbb{E}(Y)$. But quantiles associated with a monotone transformation are the transformed quantiles. So confidence tubes can still be considered as confidence tubes.

Note that here, it could have be possible to consider another transformation, with the same shape, but quite different

`> MostellerTukey(1,.5)`

Of course, there is no reason to consider a simple power function, and the Box-Cox transform can also be used. The interesting point is that the logarithm can be obtained as a particular case. Furthermore, it is also possible to seek optimal transformations, seen here as a pair of parameters. Consider

```> p=.1
> bc=boxcox(y~I(x^p),data=base,lambda=seq(.1,3,by=.1))\$y
> for(p in seq(.2,3,by=.1)) bc=cbind(bc,boxcox(y~I(x^p),data=base,lambda=seq(.1,3,by=.1))\$y)
> vp=boxcox(y~I(x^p),data=base,lambda=seq(.1,3,by=.1))\$x
> vq=seq(.1,3,by=.1)
> library(RColorBrewer)
> blues=colorRampPalette(brewer.pal(9,"Blues"))(100)
> image(vp,vq,bc,col=blues)

The darker, the better (here the log-likelihood is considered). The optimal pair is here

```> bc=function(a){p=a[1];q=a[2]; as.numeric(-boxcox(y~I(x^p),data=base,lambda=q)\$y[50])}
> optim(c(1,1), bc,method="L-BFGS-B",lower=c(0,0),upper=c(3,3))
\$par
[1] 0.5758362 0.3541601

\$value
[1] 47.27395```

and indeed, the model we get is not bad,

Fun, ins’t it?