Site icon R-bloggers

Natura non facit saltus

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

(see John Wilkins’ article on the – interesting – history of that phrase http://scienceblogs.com/evolvingthoughts/…). We will see, this week in class, several smoothing techniques, for insurance ratemaking. As a starting point, assume that we do not want to use segmentation techniques: everyone will pay exactly the same price.

And that price should be related to the pure premium, which is proportional to the frequency (or the annualized frequency, as discussed previously), since

The probability measure is mentioned here just to recall that we can use any measure. Even (based on some covariates). Without any covariate, the expected frequency should be

> regglm0=glm(nbre~1+offset(log(exposition)),data=sinistres,family=poisson)
> summary(regglm0)

Call:
glm(formula = nbre ~ 1 + offset(log(exposition)), family = poisson, 
    data = sinistres)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5033  -0.3719  -0.2588  -0.1376  13.2700  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.6201     0.0228  -114.9   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 12680  on 49999  degrees of freedom
Residual deviance: 12680  on 49999  degrees of freedom
AIC: 16353

Number of Fisher Scoring iterations: 6
> exp(coefficients(regglm0))
(Intercept) 
 0.07279295

Thus, if we do not want to take into account potential heterogeneity, we should assume that where is closed to 7.28%. Yes, as mentioned in class, it is rather common to see as a percentage, i.e. a probability, since

i.e. can be interpreted as the probability of not have a claim (see also the law of small numbers). Let us visualize this as a function of the age of the driver,

> a=18:100
> yp=predict(regglm0,newdata=data.frame(ageconducteur=a,exposition=1),type="response",se.fit=TRUE)
> yp0=yp$fit
> yp1=yp$fit+2*yp$se.fit
> yp2=yp$fit-2*yp$se.fit
> plot(a,yp0,type="l",ylim=c(.03,.12))
> abline(v=40,col="grey")
> lines(a,yp1,lty=2)
> lines(a,yp2,lty=2)
> k=23
> points(a[k],yp0[k],pch=3,lwd=3,col="red")
> segments(a[k],yp1[k],a[k],yp2[k],col="red",lwd=3)

We do predict the same frequency for all drivers, e.g. for some drive aged 40,

> cat("Frequency =",yp0[k]," confidence interval",yp1[k],yp2[k])
Frequency = 0.07279295  confidence interval 0.07611196 0.06947393

Let us now consider the case where we try to take into account heterogeneity, e.g. by age,

The idea of the (log-)Poisson regression is to assume that instead of having , we should have , where

in a very general setting. Here, let us consider only one explanatory variable, i.e.

Here, we have

> yp=predict(regglm1,newdata=data.frame(ageconducteur=a,exposition=1),
+ type="response",se.fit=TRUE)
> yp0=yp$fit
> yp1=yp$fit+2*yp$se.fit
> yp2=yp$fit-2*yp$se.fit
> plot(a,yp0,type="l",ylim=c(.03,.12))
> abline(v=40,col="grey")
> lines(a,yp1,lty=2)
> lines(a,yp2,lty=2)
> points(a[k],yp0[k],pch=3,lwd=3,col="red")
> segments(a[k],yp1[k],a[k],yp2[k],col="red",lwd=3)

i.e. the prediction for the annualized claim frequency for our 40 year old driver is now 7.74% (which is slightly higher than what we had before, 7.28%)

> cat("Frequency =",yp0[k]," confidence interval",yp1[k],yp2[k])
Frequency = 0.07740574  confidence interval 0.08117512 0.07363636

It is possible to compute not the expected frequency , but the ratio .

Above the horizontal blue line, the premium will be higher than the one obtained without segmentation, and (of course) lower below. Here, drivers younger than 44 year old will pay more, while driver older than 44 year old will be less. We have discussed, in the introduction, the necessity of segmentation. If we consider two companies, one segmenting, while the other one has a flat rate, then older drivers will go to the first company (since insurance is cheaper) while younger ones will go to the second one (again, it is cheaper). The problem is that the second company implicitly hopes that older drivers will compensate the risk. But since they’re gone, insurance will be too cheap, and the company will loose money (if not goes bankrupt). So companies have to use segmentation techniques to survive. Now, the problem is that we cannot be sure that this exponential decay of the premium is the proper way the premium should evolve as a function of the age. An alternative can be to use nonparametric techniques to visualize to true influence of the age on claims frequency.

A first model can be to consider a premium, per age. This can be done considering the age of the driver as a factor in the regression,

> regglm2=glm(nbre~as.factor(ageconducteur)+offset(log(exposition)),
+ data=sinistres,family=poisson)
> yp=predict(regglm2,newdata=data.frame(ageconducteur=a0,exposition=1),
+ type="response",se.fit=TRUE)
> yp0=yp$fit
> yp1=yp$fit+2*yp$se.fit
> yp2=yp$fit-2*yp$se.fit
> plot(a0,yp0,type="l",ylim=c(.03,.12))
> abline(v=40,col="grey")

Here, the forecast for our 40 year old driver is slightly lower than be previous one, but the confidence interval is much larger (since we focus on a very small subclass of the portfolio: drivers aged exactly 40)

Frequency = 0.06686658  confidence interval 0.08750205 0.0462311

Here, we consider too small classes, and the premium is too erratic: the premium will decrease of 20% from age 40 to 41, and then increase of 50% from age 41 to 42,

> diff(log(yp0[23:25]))
        24         25 
-0.2330241  0.5223478

There is no chance that the company will keep the insured with this strategy. This discontinuity of the premium is clearly an important issue here.

An alternative can be to consider age classes, from very young drivers to senior drivers.

> level1=seq(15,105,by=5)
> regglmc1=glm(nbre~cut(ageconducteur,level1)+offset(log(exposition)),
+ data=sinistres,family=poisson)
> summary(regglmc1)

Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         -1.6036     0.1741  -9.212  < 2e-16 ***
cut(ageconducteur, level1)(20,25]   -0.4200     0.1948  -2.157   0.0310 *  
cut(ageconducteur, level1)(25,30]   -0.9378     0.1903  -4.927 8.33e-07 ***
cut(ageconducteur, level1)(30,35]   -1.0030     0.1869  -5.367 8.02e-08 ***
cut(ageconducteur, level1)(35,40]   -1.0779     0.1866  -5.776 7.65e-09 ***
cut(ageconducteur, level1)(40,45]   -1.0264     0.1858  -5.526 3.28e-08 ***
cut(ageconducteur, level1)(45,50]   -0.9978     0.1856  -5.377 7.58e-08 ***
cut(ageconducteur, level1)(50,55]   -1.0137     0.1855  -5.464 4.65e-08 ***
cut(ageconducteur, level1)(55,60]   -1.2036     0.1939  -6.207 5.40e-10 ***
cut(ageconducteur, level1)(60,65]   -1.1411     0.2008  -5.684 1.31e-08 ***
cut(ageconducteur, level1)(65,70]   -1.2114     0.2085  -5.811 6.22e-09 ***
cut(ageconducteur, level1)(70,75]   -1.3285     0.2210  -6.012 1.83e-09 ***
cut(ageconducteur, level1)(75,80]   -0.9814     0.2271  -4.321 1.55e-05 ***
cut(ageconducteur, level1)(80,85]   -1.4782     0.3371  -4.385 1.16e-05 ***
cut(ageconducteur, level1)(85,90]   -1.2120     0.5294  -2.289   0.0221 *  
cut(ageconducteur, level1)(90,95]   -0.9728     1.0150  -0.958   0.3379    
cut(ageconducteur, level1)(95,100] -11.4694   144.2817  -0.079   0.9366    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

> yp=predict(regglmc1,newdata=data.frame(ageconducteur=a,exposition=1),
+ type="response",se.fit=TRUE)
> yp0=yp$fit
> yp1=yp$fit+2*yp$se.fit
> yp2=yp$fit-2*yp$se.fit
> plot(a,yp0,ylim=c(.03,.12),type="s")
> abline(v=40,col="grey")
> lines(a,yp1,lty=2,type="s")
> lines(a,yp2,lty=2,type="s")

Here we obtain the following predictions,

and for our 40 year old driver, the frequency is now 6.84%.

Frequency = 0.0684573  confidence interval 0.07766717 0.05924742

But our classes were defined arbitrarily here. Perhaps should we consider other classes, to see if the prediction is sensitive to the cutting values,

> level2=level1-2
> regglmc2=glm(nbre~cut(ageconducteur,level2)+offset(log(exposition)),
+ data=sinistres,family=poisson)

which yields the following values for our 40 year old driver,

Frequency = 0.07050614  confidence interval 0.07980422 0.06120807

So here, we did not remove the discontinuity problem. An idea here can be to consider moving regions: if the goal is to predict the frequency for a 40 year old driver, perhaps the class should be (somehow) centered around 40. And center the interval around 35 for drivers aged 35. Etc.

Thus, it is natural to consider some local regressions, where only drivers aged almost 40 should be considered. This almost concept is related to the bandwidth. For instance, drivers between 35 and 45 can be considered as being almost40. In practice we can either consider a subset function, or we can use weights in the regressions

> value=40
> h=5
> sinistres$omega=(abs(sinistres$ageconducteur-value)<=h)*1
> regglmomega=glm(nbre~ageconducteur+offset(log(exposition)),
+ data=sinistres,family=poisson,weights=omega)

To see what’s going on, let us consider an animated plot, where the age of interest is changing,

Here, for our 40 year old drive, we get

Frequency = 0.06913391  confidence interval 0.07535564 0.06291218

We do obtain a curve that can be interpreted as a local regression. But here, we do not take into account that 35 is not as close to 40 as 39 could be. An here, 34 is assumed to be very far away from 40. Clearly, we could improve that technique: kernel functions can considered, i.e. the closer to 40, the larger the weight.

> value=40
> h=5
> sinistres$omega=dnorm(abs(sinistres$ageconducteur-value)/h)
> regglmomega=glm(nbre~ageconducteur+offset(log(exposition)),
+ data=sinistres,family=poisson,weights=omega)

which can be plotted below

Here, our prediction for our 40 year old drive is

Frequency = 0.07040464  confidence interval 0.07981521 0.06099408

This is the idea of kernel regression techniques. But as explained in the slides, other non parametric techniques can be considered, like spline functions.

In R, it is simple to use spline function (somehow much more simple than kernel smoothers)

> library(splines)
> regglmbs=glm(nbre~bs(ageconducteur)+offset(log(exposition)),
+ data=sinistres,family=poisson)

The prediction for our 40 year old driver is now

Frequency = 0.06928169  confidence interval 0.07397124 0.06459215

Note that this techniques is related to another class of models, the so-called Generalized Additive Models, i.e. GAMs.

> library(mgcv)
> reggam=gam(nbre~s(ageconducteur)+offset(log(exposition)),
+ data=sinistres,family=poisson)

The prediction is extremely close to the one we obtained above (the main differences being observed for very old drivers)

Frequency = 0.06912683  confidence interval 0.07501663 0.06323702

Somehow, one way or another, all those models are valid. So perhaps we should compare them,

On the graph above, we can visualize the upper and the lower bound of the prediction, for the 9 models. The horizontal line is the predicted value without taking into account heterogeneity. It is possible to consider relative values, with respect to this value,

Arthur Charpentier

Arthur Charpentier, professor in Montréal, in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1.  Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.

More PostsWebsite

Follow Me:

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.