Large claims, and ratemaking

Posted on February 13, 2013 by arthur charpentier in Uncategorized | 0 Comments

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

During the course, we have seen that it is natural to assume that not only the individual claims frequency can be explained by some covariates, but individual costs too. Of course, appropriate families should be considered to model the distribution of the cost $http://latex.codecogs.com/gif.latex?Y$ , given some covariates $http://latex.codecogs.com/gif.latex?\boldsymbol{X}$ .Here is the dataset we’ll use,

>  sinistre=read.table("http://freakonometrics.free.fr/sinistreACT2040.txt",
+  header=TRUE,sep=";")
>  sinistres=sinistre[sinistre$garantie=="1RC",]
>  sinistres=sinistres[sinistres$cout>0,]
>  contrat=read.table("http://freakonometrics.free.fr/contractACT2040.txt",
+  header=TRUE,sep=";")
>  couts=merge(sinistres,contrat)
> tail(couts)
     nocontrat    no garantie    cout exposition zone puissance agevehicule
1919   6104006 11933      1RC 5376.04       0.37    E         6           1
1920   6107355 12349      1RC   51.63       0.74    E         4           1
1921   6108364 13229      1RC 1320.00       0.74    B         9           1
1922   6109171 11567      1RC 1320.00       0.74    B        13           1
1923   6111208 14161      1RC  970.20       0.49    E        10           5
1924   6111650 14476      1RC 1940.40       0.48    E         4           0
     ageconducteur bonus marque carburant densite region
1919            32    57     12         E      93     10
1920            45    57     12         E      72     10
1921            32   100     12         E      83      0
1922            56    50     12         E      93     13
1923            30    90     12         E      53      2
1924            69    50     12         E      93     13

Here, each line is a claim. Usual families to model the cost are the Gamma distribution, or the inverse Gaussian. Or the lognormal distribution (which is not in the exponential family, but one can assume that the logarithm of the cost can be modeled with a Gaussian distribution). Consider here only one covariate, e.g. the age of the car, and two different models: a Gamma one, and a lognormal one.

> age=0:20
> reggamma.sp <- glm(cout~agevehicule,family=Gamma(link="log"),
+ data=couts)
> Pgamma <- predict(reggamma.sp,newdata=data.frame(agevehicule=age),type="response")

For the Gamma regression, it is a simple GLM, so it is not difficult. For a lognormal distribution, one should remember that the expected value of a lognormal distribution is not the exponential of the underlying Gaussian distribution. A correction should be made, here to get an unbiased estimator for the average cost,

> reglm.sp <- lm(log(cout)~agevehicule,data=baseCOUT)
> sigma <- summary(reglm.sp)$sigma
> mu <- predict(reglm.sp,newdata=data.frame(agevehicule=age))
> Pln <- exp(mu+sigma^2/2)

We can plot those two predictions on a single graph,

> plot(age,Pgamma,xlab="",ylab="",col="red",type="b",pch=4)
> lines(age,Pln,col="blue",type="b")

Here it is,

Observe that it is also possible to use splines, since there might be no reason for the age to appear here in a multiplicative way,

Here, the two models are rather close. Nevertheless, one should remember that the Gamma model can be extremely sensitive to large claims (I mean here really large claims). On the other hand, with the log-transformation for the lognormal model, it seams that this model is less sensitive to large events. Actually, if I use the complete dataset, the regressions are the following,

i.e. with a lognormal distribution, the average cost is decreasing with the age of the car, while it is increasing with a Gamma model. The main reason here is that there is one large (not to say huge) claim in the dataset,

> couts[which.max(couts$cout),]
         cout exposition zone puissance agevehicule ageconducteur
7842  4024601       0.22    B         9          13            19
     marque carburant densite region
7842      2         E      93     24

One young driver got a $ 4 million claim, with a 13 year old car. This is an outliers for the Gamma regression, that clearly influences the estimation (the second largest if only one third of this one). Since there is a clear influence of large claims on the estimation of the average cost, a natural idea might be to remove those large claims. Or perhaps to see them as different from normal claims: normal claims can be explained by some covariates, but perhaps that those large claims should be shared not only within its own class, but within all the insured on the portfolio. To formalize this idea, observe that we can write

$http://latex.codecogs.com/gif.latex?\mathbb{E}(Y|\boldsymbol{X})%20=%20{\color{Blue}%20{\underbrace{\mathbb{E}(Y|\boldsymbol{X},Y\leq%20s)}_{A}%20\cdot%20{\underbrace{\mathbb{P}(Y\leq%20s|\boldsymbol{X})}_{B}}}}+{\color{Red}%20{{\underbrace{\mathbb{E}(Y|Y%3E%20s,%20\boldsymbol{X})%20}_{C}}\cdot%20{\underbrace{\mathbb{P}(Y%3E%20s|%20\boldsymbol{X})}_{B}}}}$

where the blue part is associated to normal-sized claims, while large ones correspond to the red part. It is then possible to run three regressions: one on normal sized claims, one on large claims, and one on the indicator of having a large claims, given that a claim occurred. The code here is something like that: a large claim – here – is above $ 10,000 (one has a fix it)

> s= 10000
> couts$normal=(couts$cout<=s)
> mean(couts$normal)
[1] 0.9818087

which represent 2% of the claims in our dataset.We can run 3 sets of regressions, with smoothed regression on the age of the car. The first one to model large claims individual costs,

> indice = which(couts$cout>s)
> mean(couts$cout[indice])
[1] 34471.59
> library(splines)
> regB=glm(cout~bs(agevehicule),data=couts,
+ subset=indice,family=Gamma(link="log"))
> ypB=predict(regB,newdata=data.frame(agevehicule=age),type="response")
> ypB2=mean(couts$cout[indice])

the second one to model normal claims individual costs,

> indice = which(couts$cout<=s)
> mean(couts$cout[indice])
[1] 1335.878
> regA=glm(cout~bs(agevehicule),data=couts,
+ subset=indice,family=Gamma(link="log"))
> ypA=predict(regA,newdata=data.frame(agevehicule=age),type="response")
> ypA2=mean(couts$cout[indice])

And finally, a third one, on the probability of having a normal sized claim, given that a claim occurred

> regC=glm(normal~bs(agevehicule),data=couts,family=binomial)
> ypC=predict(regC,newdata=data.frame(agevehicule=age),type="response")
> regC2=glm(normal~1,data=couts,family=binomial)
> ypC2=predict(regC2,newdata=data.frame(agevehicule=age),type="response")

Note that we to have, each time something that can be interpreted either as $http://latex.codecogs.com/gif.latex?\mathbb{E}(Y|\boldsymbol{X},Y\gtrless%20%20s)$ , or $http://latex.codecogs.com/gif.latex?\mathbb{E}(Y|Y\gtrless%20%20s)$ – i.e. no covariate is considered on the later. On the graph below, we did plot

where Gamma regressions – with splines – are considered for the average costs, while logistic regressions – again with splines – are considered to model probabilities.

(but careful with splines: on borders, since we do not have a lot of observations, the behavior can be… odd. And adjustments should be made to obtain an adequate level of premium). If it is legitimate to assume that normal-sized claims can be explained by some covariates, perhaps large claims (or extremely large ones) are just purely random, i.e. not function of any covariate, at all. I.e.

$http://latex.codecogs.com/gif.latex?\mathbb{E}(Y|\boldsymbol{X})%20=%20{\color{Blue}%20{\underbrace{\mathbb{E}(Y|\boldsymbol{X},Y\leq%20s)}_{A}%20\cdot%20{\underbrace{\mathbb{P}(Y\leq%20s|\boldsymbol{X})}_{B}}}}+{\color{Red}%20{{\underbrace{\mathbb{E}(Y|Y%3E%20s)%20}_{C%27}}\cdot%20{\underbrace{\mathbb{P}(Y%3E%20s|%20\boldsymbol{X})}_{B}}}}$

To go one step further, it might also be possible to assume that not only the size of the claim (given that it is a large one) is not a function of any covariate, but perhaps neither is the probability of having an extremely large claim, too

$http://latex.codecogs.com/gif.latex?\mathbb{E}(Y|\boldsymbol{X})%20=%20{\color{Blue}%20{\underbrace{\mathbb{E}(Y|\boldsymbol{X},Y\leq%20s)}_{A}%20\cdot%20{\underbrace{\mathbb{P}(Y\leq%20s)}_{B%27}}}}+{\color{Red}%20{{\underbrace{\mathbb{E}(Y|Y%3E%20s)%20}_{C%27}}\cdot%20{\underbrace{\mathbb{P}(Y%3E%20s)}_{B%27}}}}$

From the first part, we’ve seen that the distribution considered had an impact on the prediction, and in the second, we’ve seen that the definition of large claims (and how to deal with them) also has an impact. So clearly, actuaries have some leverage when working on ratemaking…

Arthur Charpentier

Arthur Charpentier, professor in Montréal, in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1. Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.