Model selection, AIC and Tweedie regression

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Just some simple codes to illustrate some points we will discuss this week, for the last course on GLMs, before the final exam.  We have mentioned that the Gamma distribution belongs to the exponential, so we can run a regression, and compute the associated AIC,

> set.seed(123)
> test.data = rgamma(n=2000, scale=1, shape=1)
> m1 = glm( test.data~1, family=Gamma(link=log))
> AIC(m1)
[1] 3997.332

The Gamma distribution is also a special case of the Tweedie distribution, with power 2

> library(statmod)
> library(tweedie)
> m2 = glm( test.data~1, family=tweedie(link.power=0, var.power=2) )
> AIC(m2)
[1] NA

Unfortunately, we cannot compute the AIC, and we need a trick (with the appropriate R function).

> AICtweedie(m2)
[1] 3997.332

Of course, we can do the same with the Poisson distribution, which also belongs to the exponential family

> test.data = rpois(n=2000, lambda=1)
> m3 = glm( test.data~1, family=poisson(link=log))
> m4 = glm( test.data~1, family=tweedie(link.power=0, var.power=1) )
> AIC(m3)
[1] 5124.61

Here, we have a problem with the AICtweedie function

> AICtweedie(m4)
[1] Inf

because we need to specify the dispersion parameter

> AICtweedie(m4, dispersion=1)
[1] 5124.61

We can now check: we generate some Gamma sample, and fit various Tweedie distribution, changing simply the variance function (which is a power function)

> set.seed(123)
> test.data = rgamma(n=2000, scale=1, shape=1)
> glmtw = function(t){
+ m1 = glm( test.data~1, family=tweedie(link.power=0, var.power=t) )
+ d = NULL
+ if(t == 1) d = 1
+ AICtweedie(m1, dispersion = d)
+ 
+ }
> vt = seq(1,2.7,length=100)
> vg = Vectorize(glmtw)(vt)
> plot(vt,vg,log="y",type="l")

The minimum of the AIC is close to 2, corresponding to the Gamma distribution

We can also try with a Poisson

> set.seed(123)
> test.data = rpois(n=2000, lambda=1)
> glmtw = function(t){
+ m1 = glm( test.data~1, family=tweedie(link.power=0, var.power=t) )
+ d = NULL
+ if(t == 1) d = 1
+ AICtweedie(m1, dispersion = d)
+ 
+ }
> vt = seq(1,2,length=100)
> vg = Vectorize(glmtw)(vt)
> plot(vt,vg,log="y",type="l")

The minimum is now close to 1, corresponding to the Poisson distriubtion (the variance is equal to the average)

Let us now try some compound Poisson distribution,

> rcpd=function(n,lambda,shape,scale){
+ N=rpois(n,lambda)
+ X=rgamma(sum(N),shape=shape, scale=scale)
+ I=as.factor(rep(1:n,N))
+ S=tapply(X,I,sum)
+ V=as.numeric(S[as.character(1:n)])
+ V[is.na(V)]=0
+ return(V)}

Let us generate some compound Poisson random variables, with Poisson distribution with average 1, and with Gamma jumps, with average and variance 1,

> set.seed(123)
> test.data = rcpd(n=2000, 1,1,1)
> glmtw = function(t){
+ m1 = glm( test.data~1, family=tweedie(link.power=0, var.power=t) )
+ d = NULL
+ if(t == 1) d = 1
+ AICtweedie(m1, dispersion = d)
+ }
> vt = seq(1.1,1.9,length=100)
> vg = Vectorize(glmtw)(vt)
> plot(vt,vg,log="y",type="l")

The optimal value for the power function is here 1.5, based on the AIC (relationships between Tweedie parameters and the compound Poisson ones are given in the slides)

We can now play a little bit with the variance of the jumps: they still have aveage 1, but they now have a smaller variance

> set.seed(123)
> test.data = rcpd(n=2000, 1,3,1/3)
> vt = seq(1.05,1.95,length=100)
> vg = Vectorize(glmtw)(vt)
> plot(vt,vg,log="y",type="l")

The optimal power for the Tweedie is closer to one, closer to the Poison case

while if we increase the variance of the jumps

> set.seed(123)
> test.data = rcpd(n=2000, 1,1/3,3)
> vt = seq(1.05,1.95,length=100)
> vg = Vectorize(glmtw)(vt)
> plot(vt,vg,log="y",type="l")

the optimal power is higher, closer to the Gamma distribution.

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)