**Freakonometrics » R-english**, and kindly contributed to R-bloggers)

During the course, we have seen that it is natural to assume that not only the individual claims frequency can be explained by some covariates, but individual costs too. Of course, appropriate families should be considered to model the distribution of the cost , given some covariates .Here is the dataset we’ll use,

> sinistre=read.table("http://freakonometrics.free.fr/sinistreACT2040.txt", + header=TRUE,sep=";") > sinistres=sinistre[sinistre$garantie=="1RC",] > sinistres=sinistres[sinistres$cout>0,] > contrat=read.table("http://freakonometrics.free.fr/contractACT2040.txt", + header=TRUE,sep=";") > couts=merge(sinistres,contrat) > tail(couts) nocontrat no garantie cout exposition zone puissance agevehicule 1919 6104006 11933 1RC 5376.04 0.37 E 6 1 1920 6107355 12349 1RC 51.63 0.74 E 4 1 1921 6108364 13229 1RC 1320.00 0.74 B 9 1 1922 6109171 11567 1RC 1320.00 0.74 B 13 1 1923 6111208 14161 1RC 970.20 0.49 E 10 5 1924 6111650 14476 1RC 1940.40 0.48 E 4 0 ageconducteur bonus marque carburant densite region 1919 32 57 12 E 93 10 1920 45 57 12 E 72 10 1921 32 100 12 E 83 0 1922 56 50 12 E 93 13 1923 30 90 12 E 53 2 1924 69 50 12 E 93 13

Here, each line is a claim. Usual families to model the cost are the Gamma distribution, or the inverse Gaussian. Or the lognormal distribution (which is not in the exponential family, but one can assume that the logarithm of the cost can be modeled with a Gaussian distribution). Consider here only one covariate, e.g. the age of the car, and two different models: a Gamma one, and a lognormal one.

> age=0:20 > reggamma.sp <- glm(cout~agevehicule,family=Gamma(link="log"), + data=couts) > Pgamma <- predict(reggamma.sp,newdata=data.frame(agevehicule=age),type="response")

For the Gamma regression, it is a simple GLM, so it is not difficult. For a lognormal distribution, one should remember that the expected value of a lognormal distribution is not the exponential of the underlying Gaussian distribution. A correction should be made, here to get an unbiased estimator for the average cost,

> reglm.sp <- lm(log(cout)~agevehicule,data=baseCOUT) > sigma <- summary(reglm.sp)$sigma > mu <- predict(reglm.sp,newdata=data.frame(agevehicule=age)) > Pln <- exp(mu+sigma^2/2)

We can plot those two predictions on a single graph,

> plot(age,Pgamma,xlab="",ylab="",col="red",type="b",pch=4) > lines(age,Pln,col="blue",type="b")

Here it is,

Observe that it is also possible to use splines, since there might be no reason for the age to appear here in a multiplicative way,

Here, the two models are rather close. Nevertheless, one should remember that the Gamma model can be extremely sensitive to *large* claims (I mean here *really large* claims). On the other hand, with the log-transformation for the lognormal model, it seams that this model is less sensitive to large events. Actually, if I use the complete dataset, the regressions are the following,

i.e. with a lognormal distribution, the average cost is *decreasing* with the age of the car, while it is increasing with a Gamma model. The main reason here is that there is one large (not to say *huge*) claim in the dataset,

> couts[which.max(couts$cout),] cout exposition zone puissance agevehicule ageconducteur 7842 4024601 0.22 B 9 13 19 marque carburant densite region 7842 2 E 93 24

One young driver got a $ 4 million claim, with a 13 year old car. This is an outliers for the Gamma regression, that clearly influences the estimation (the second largest if only one third of this one). Since there is a clear influence of large claims on the estimation of the average cost, a natural idea might be to remove those large claims. Or perhaps to see them as different from *normal* claims: *normal *claims can be explained by some covariates, but perhaps that those large claims should be shared not only within its own class, but within all the insured on the portfolio. To formalize this idea, observe that we can write

where the blue part is associated to *normal-sized *claims, while *large* ones correspond to the red part. It is then possible to run three regressions: one on *normal *sized claims, one on *large* claims, and one on the indicator of having a *large* claims, given that a claim occurred. The code here is something like that: a *large *claim – here – is above $ 10,000 (one has a fix it)

> s= 10000 > couts$normal=(couts$cout<=s) > mean(couts$normal) [1] 0.9818087

which represent 2% of the claims in our dataset.We can run 3 sets of regressions, with smoothed regression on the age of the car. The first one to model large claims individual costs,

> indice = which(couts$cout>s) > mean(couts$cout[indice]) [1] 34471.59 > library(splines) > regB=glm(cout~bs(agevehicule),data=couts, + subset=indice,family=Gamma(link="log")) > ypB=predict(regB,newdata=data.frame(agevehicule=age),type="response") > ypB2=mean(couts$cout[indice])

the second one to model normal claims individual costs,

> indice = which(couts$cout<=s) > mean(couts$cout[indice]) [1] 1335.878 > regA=glm(cout~bs(agevehicule),data=couts, + subset=indice,family=Gamma(link="log")) > ypA=predict(regA,newdata=data.frame(agevehicule=age),type="response") > ypA2=mean(couts$cout[indice])

And finally, a third one, on the probability of having a normal sized claim, given that a claim occurred

> regC=glm(normal~bs(agevehicule),data=couts,family=binomial) > ypC=predict(regC,newdata=data.frame(agevehicule=age),type="response") > regC2=glm(normal~1,data=couts,family=binomial) > ypC2=predict(regC2,newdata=data.frame(agevehicule=age),type="response")

Note that we to have, each time something that can be interpreted either as , or – i.e. no covariate is considered on the later. On the graph below, we did plot

where Gamma regressions – with splines – are considered for the average costs, while logistic regressions – again with splines – are considered to model probabilities.

(but careful with splines: on borders, since we do not have a lot of observations, the behavior can be… odd. And adjustments should be made to obtain an adequate level of premium). If it is legitimate to assume that *normal-sized *claims can be explained by some covariates, perhaps *large* claims (or extremely large ones) are just purely random, i.e. not function of any covariate, at all. I.e.

To go one step further, it might also be possible to assume that not only the size of the claim (given that it is a large one) is not a function of any covariate, but perhaps neither is the *probability *of having an extremely large claim, too

From the first part, we’ve seen that the distribution considered had an impact on the prediction, and in the second, we’ve seen that the definition of *large* claims (and how to deal with them) also has an impact. So clearly, actuaries have some leverage when working on ratemaking…

**leave a comment**for the author, please follow the link and comment on their blog:

**Freakonometrics » R-english**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...