# I Fought the (distribution) Law (and the Law did not win)

April 27, 2015
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

A few days ago, I was asked if we should spend a lot of time to choose the distribution we use, in GLMs, for (actuarial) ratemaking. On that topic, I usually claim that the family is not the most important parameter in the regression model. Consider the following dataset

```> db <- data.frame(x=c(1,2,3,4,5),y=c(1,2,4,2,6))
> plot(db,xlim=c(0,6),ylim=c(-1,8),pch=19)```

To visualize a regression model, use the following code

```> nd=data.frame(x=seq(0,6,by=.1))
+ prd1=predict(reg,newdata=nd,se.fit = TRUE,type="response")
+ y1=prd1\$fit
+ y1_upp=prd1\$fit+prd1\$residual.scale*1.96*
prd1\$se.fit
+ y1_low=prd1\$fit-prd1\$residual.scale*1.96*
prd1\$se.fit
+ polygon(c(nd\$x,rev(nd\$x)),c(y1_upp,
rev(y1_low)),col="light green",angle=90,
density=40,border=NA)
+ lines(nd\$x,y1,col="red",lwd=2)
+ }```

For instance, with a Poisson regression (with a log link function) we get

```> plot(db)
+ data=db)
```

while, with a Gaussian regresion (but still with a log link function), we get

```> plot(db)
+ data=db)

If we just care about the expected value of our prediction, the output is more or less the same

```> plot(db)
> lines(nd\$x,predict(reg1,newdata=nd,
+ type="response"),col="red",lwd=1.5)
> lines(nd\$x,predict(reg2,newdata=nd,
+ type="response"),col="blue",lwd=1.5)```

So, indeed, forget about the (distribution) law when running a GLM. Not convinced? Consider – on the same dataset – a Poisson regression (with an identity link function this time)

```> plot(db)
+ data=db)
```

while, with a Gaussian regresion (but still with an identity link function), we get

```> plot(db)
+ data=db)

Again, if we just plot the expected value of our prediction, the output is more or less the same

```> plot(db)
> lines(nd\$x,predict(reg1,newdata=nd,
+ type="response"),col="red",lwd=1.5)
> lines(nd\$x,predict(reg2,newdata=nd,
+ type="response"),col="blue",lwd=1.5)```

So clearly, the simplistic message you should not care too much about the (distribution) law seems to be valid…

But if we compare those graphs (and the confidence intervals) it looks like the regression can be sensitive to outliers. Or not.

To visualize the impact of a change in the value of some observations, use

```> change_pred_poisson=function(x){
+   prd=function(y){
+   db_rev=db
+   db_rev\$y[3]=y
+ data=db_rev)
+   prd1=predict(reg1,newdata=data.frame(x=x),
+ se.fit = TRUE,type="response")
+   y1=prd1\$fit
+   y1_upp=prd1\$fit+prd1\$residual.scale*
+ 1.96*prd1\$se.fit
+   y1_low=prd1\$fit-prd1\$residual.scale*1.96*
+ prd1\$se.fit
+   c(y1,y1_low,y1_upp)
+   }
+   vx=seq(1,8,by=.1)
+   s=sapply(vx,prd)
+   plot(vx,vx,col="white",xlab="",ylab="")
+   polygon(c(vx,rev(vx)),c(s[2,],rev(s[3,])),
+ col="light green",border=NA)
+   lines(vx,s[1,],col="red",lwd=2)
+   invisible(s)
+ }
```

Here we change the third observation. Its -value was 4. Now we change it, from 1 up to 8.  And we look at the predicted value, with a Poisson regression, for the same value, the third one

`> s_poisson=change_pred_poisson(3)`

Now, we can write a similar code,  with a Gaussian regression

`> s_gaussian=change_pred_gaussian(3)`

Even with a Gamma regression

`> s_gamma=change_pred_gamma(3)`

If we compare now the three models, we get

```> plot(seq(1,8,by=.1),s_gauss[1,],
+ type="l",lwd=2,xlab="",ylab="")
> lines(seq(1,8,by=.1),s_poisson[1,],
+ col="red",lwd=2)
> lines(seq(1,8,by=.1),s_gamma[1,],col=
+ "blue",lwd=2)
> legend("topleft",c("Gaussian","Poisson",
+ "Gamma"),col=c("black","red","blue"),lty=1)```

So, the (distribution) law has an impact on our prediction. When -value is close to 4 (the original value), the three models are equivalent, but not if we have a very small value (e.g. close to 1), where the Gaussian model is far away, compared with the other two.

Nevertheless, I still have the feeling that the choice of the distribution is not the most important problem. I used to say that the choice of the link function is clearly a more important problem. Especally in actuarial ratemaking

But I wonder if actually it is really such a big deal. If we use splines, and nonparametric transformations. Consider a Poisson regression, with some identity link function, and some spline smoothing function

```> library(splines)
> plot(db)
+ "identity"),data=db)
Warning message:
In bs(x, degree = 3L, knots = numeric(0), Boundary.knots = c(1,  :
some 'x' values beyond boundary knots may cause ill-conditioned bases```

(since we have only 6 points, it is a bit stupid to consider such a complex model, but here, it is just a toy dataset, to illustrate)

With a log link function (and still some spline smoothing)

```> plot(db)
+ "log"),data=db)
Warning message:
In bs(x, degree = 3L, knots = numeric(0), Boundary.knots = c(1,  :
some 'x' values beyond boundary knots may cause ill-conditioned bases
```

The impact on the prediction is clearly smaller than what we had, without any smoothing parameter

Again, if you’re not convinced, try to generate a larger dataset. For instance, generate a Poisson regression, with a log link.

```> set.seed(1)
> x=runif(1000,0,6)
> L=exp(-0.07357+0.35076*x)
> y=rpois(1000,L)
> db=data.frame(x,y)
> plot(db)```

And fit a Gaussian regression, with an identity link, but with some spline smoothing function,

```> reg1=glm(y~x,family=poisson(link="log"),
+ data=db)
> y1=predict(reg1,newdata=nd,type="response")
> lines(nd\$x,y1,col="red",lwd=2)
+ "identity"),data=db)
> y2=predict(reg2,newdata=nd,type="response")
> lines(nd\$x,y2,col="blue",lwd=2)```

So it looks like neither the (distribution) law, nor the link function, have an impact on a prediction. Sad, isn’t it?

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...