Exposure as a possible explanatory variable

August 13, 2013
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

Iin insurance pricing, the exposure is usually used as an offset variable to model claims frequency. As explained many times on this blog (e.g. here), and in my notes, if we have to identical drivers, but one with an exposure of 6 months, and the other one of one year, it should be natural to assume that, on average, the second driver will have two times more accidents. This is the motivation to use a standard (homogeneous) Poisson process to model claim frequency. One can also see here legal issue, since, in case of a (partial) reinbursement of a premium, it would be done prorata temporis. The risk is proportional to the exposure. Thus, if  denote the number of claims of insured , with characteristics and exposure , with a Poisson regression, we would write

or equivalently

From this expression, the logarithm of the exposure is an explanatory variable, but there should be no coefficient (the coefficient here is taken to be one). Can’t we use the exposure as an explanatory variable ? Will we get a unit parameter ?

Of course, in the context of ratemaking, it is probably not a relevant question, since actuaries are required to predict annual claim frequency (since insurance contract are supposed to provide a one year coverage). But it might be interesting to get a better understanding of why people might be leaving our portfolio (i.e. are cancelling their insurance policy before term, or not renew someday).

To be more specific and get a better understanding, consider the following model: consider a Poisson process to model claims arrival, and people dedicated to their insurance company (they never leave). Let us generate scenarios over twenty years

> n=983
> D1=as.Date("01/01/1993",'%d/%m/%Y')
> D2=as.Date("31/12/2013",'%d/%m/%Y')
> L=D1+0:(D2-D1)
> set.seed(1)
> arrival=sample(L,size=n,replace=TRUE)
> exposure=N=rep(NA,n)
> departure=rep(D2,n)
> set.seed(2)
> for(i in 1:n){
+   expo=D2-arrival[i]
+   w=0
+   while(max(w) df=data.frame(N=N,E=exposure/365)

Here the expected time between claims is considered to be 1000 days. The (annual) intensity of the Poisson process is here

> 365/1000
[1] 0.365

so if we run a Poisson regression on the logarithm of the exposure (please feel free to had other covariates if you want, the example here is just to see what could happen when exposure is considered as a standard covariate), we should get a parameter close to

> log(365/1000)
[1] -1.007858

Here, the regression on a constant, with the offset variable is

> reg=glm(N~1+offset(log(E)),data=df,family=poisson)
> summary(reg)

Call:
glm(formula = N ~ 1 + offset(log(E)), family = poisson, data = df)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-3.4145  -0.4673   0.2367   0.8770   3.6828

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.04233    0.02532  -41.17   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 1116.9  on 982  degrees of freedom
Residual deviance: 1116.9  on 982  degrees of freedom
AIC: 3282.9

Number of Fisher Scoring iterations: 5

which is consistent with what we just said. If we run the regression with the logarithm of the exposure as a possible explanatory variable, we would expect to have a coefficient close to 1. And indeed…

> reg=glm(N~log(E),data=df,family=poisson)
> summary(reg)

Call:
glm(formula = N ~ log(E), family = poisson, data = df)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-3.0810  -0.8373  -0.1493   0.5676   3.9001

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.03350    0.08546  -12.09   <2e-16 ***
log(E)       1.00920    0.03292   30.66   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 2553.6  on 982  degrees of freedom
Residual deviance: 1064.2  on 981  degrees of freedom
AIC: 3762.7

Number of Fisher Scoring iterations: 5

If we keep the offset, and add the variable, we can see that it become useless (which is a test of a unit parameter, somehow)

> reg=glm(N~log(E)+offset(log(E)),data=df,family=poisson)
> summary(reg)

Call:
glm(formula = N ~ log(E) + offset(log(E)), family = poisson,
data = df)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-3.0810  -0.8373  -0.1493   0.5676   3.9001

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.033503   0.085460 -12.093   <2e-16 ***
log(E)       0.009201   0.032920   0.279     0.78
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 1064.3  on 982  degrees of freedom
Residual deviance: 1064.2  on 981  degrees of freedom
AIC: 3762.7

Number of Fisher Scoring iterations: 5

Here, we do have pure Poisson processes, so exposure is crucial, since the parameter of the Poisson distribution is proportional to the exposure. But we cannot learn anything else from the exposure.

Consider some real data.

> head(baseFREQ)
nocontrat exposition zone puissance agevehicule
1        27       0.87    C         7           0
2       115       0.72    D         5           0
3       121       0.05    C         6           0
4       142       0.90    C        10          10
5       155       0.12    C         7           0
6       186       0.83    C         5           0
ageconducteur bonus marque carburant densite region nbre
1            56    50     12         D      93     13    0
2            45    50     12         E      54     13    0
3            37    55     12         D      11     13    0
4            42    50     12         D      93     13    0
5            59    50     12         E      73     13    0
6            75    50     12         E      42     13    0

What do we get if we consider a Poisson regression on the logarithm of the exposure ?

> reg=glm(nbre~log(exposition),data=baseFREQ,family=poisson)
> summary(reg)

Call:
glm(formula = nbre ~ log(exposition), family = poisson, data = baseFREQ)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-0.3988  -0.3388  -0.2786  -0.1981  12.9036

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)     -2.83045    0.02822 -100.31   <2e-16 ***
log(exposition)  0.53950    0.02905   18.57   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 12931  on 49999  degrees of freedom
Residual deviance: 12475  on 49998  degrees of freedom
AIC: 16150

Number of Fisher Scoring iterations: 6

If we add the exposure to the offset, what’s happening ? (let us use a nonparametric transformation, so visualize what’s going on)

> library(gam)
> reg=gam(nbre~offset(log(exposition))+s(exposition),data=baseFREQ,family=poisson)
> plot(reg,se=TRUE)

There is a clear and significant effect. The more insured stay, the less likely they get a claim. Actually, it can be observed without running a regression.

> i1=which(baseFREQ$nbre>0) > i0=which(baseFREQ$nbre==0)
> h1=hist(baseFREQ$exposition[i1],probability=TRUE) > h0=hist(baseFREQ$exposition[i0],probability=TRUE)
> plot(h1$mids,h1$density,type='s',lwd=2,col="red")
> lines(h0$mids,h0$density,type='s',col='blue',lwd=2)

In blue, we have the density of the exposure for those who did not have claims, and in red, the density of those who did have one claim (or more)

So here, we cannot assume a unit value for the parameter. What does that mean ? Can we reproduce such a behavior ?

In order to get a better understandung, consider two possible behaviors for the insured. The first one will be the following : if the company does not offer substantial discounts after no several years with no claims, the insured might leave the company. For instance, if the insured has no claim during 5 years, then after 5 years, he will leave the company (to get a better price somewhere else, say). The code will be

> for(i in 1:n){
+   expo=D2-arrival[i]
+   w=c(0,0)
+   while((max(w)1500) departure[i]=arrival[i]+max(w[-length(w)])+1500
+   exposure[i]=departure[i]-arrival[i]
+   N[i]=max(0,length(w)-3)}
> df=data.frame(N=N,E=exposure/365)

Here, I consider 1500 days, instead of 5 years,, but it is the same idea. So, what do we have here ?

> reg=glm(N~log(E),data=df,family=poisson)
> summary(reg)

Call:
glm(formula = N ~ log(E), family = poisson, data = df)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.5684  -0.9668  -0.2321   0.4244   3.6265

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.50844    0.10286  -24.39   <2e-16 ***
log(E)       1.65738    0.04494   36.88   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 2567.31  on 982  degrees of freedom
Residual deviance:  885.71  on 981  degrees of freedom
AIC: 2897.9

Here, the coefficient is (significantly) larger than 1. More precisely,

> reg=glm(N~log(E)+offset(log(E)),data=df,family=poisson)
> summary(reg)

Call:
glm(formula = N ~ log(E) + offset(log(E)), family = poisson,
data = df)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.5684  -0.9668  -0.2321   0.4244   3.6265

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.50844    0.10286  -24.39   <2e-16 ***
log(E)       0.65738    0.04494   14.63   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 1114.24  on 982  degrees of freedom
Residual deviance:  885.71  on 981  degrees of freedom
AIC: 2897.9

There is clearly a bias here : people staying long are more like likely to have an accident. Which is consistent with our story, since clients with low risks left.

The second behavior will be the following : sometimes, the insured are not satisfied with the way claims are handled, and they might leave after the first claim. Consider the case where, after one claim, it is likely (e.g. with probability 50%) that the insured leaves the company. Instead of assuming that the insured did not like claims management, consider the case were the car is so damaged that he cannot drive it anymore. So it will be useless to pay an insurance premium. The code here will be

> for(i in 1:n){
+   expo=D2-arrival[i]
+   w=0
+   stay=TRUE
+   while((max(w) df=data.frame(N=N,E=exposure/365)

Here, after each claim, the insured toss a coin to see if he cancels the contract, or not.

> reg=glm(N~log(E),data=df,family=poisson)
> summary(reg)

Call:
glm(formula = N ~ log(E), family = poisson, data = df)

Deviance Residuals:
Min        1Q    Median        3Q       Max
-2.28402  -0.47763  -0.08215   0.33819   2.37628

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.09920    0.04251   2.334   0.0196 *
log(E)       0.30640    0.02511  12.203   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 666.92  on 982  degrees of freedom
Residual deviance: 498.29  on 981  degrees of freedom
AIC: 2666.3

This time, the parameter is (again significantly) smaller than one.

> reg=glm(N~log(E)+offset(log(E)),data=df,family=poisson)
> summary(reg)

Call:
glm(formula = N ~ log(E) + offset(log(E)), family = poisson,
data = df)

Deviance Residuals:
Min        1Q    Median        3Q       Max
-2.28402  -0.47763  -0.08215   0.33819   2.37628

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.09920    0.04251   2.334   0.0196 *
log(E)      -0.69360    0.02511 -27.625   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 1116.87  on 982  degrees of freedom
Residual deviance:  498.29  on 981  degrees of freedom
AIC: 2666.3

The story is now rather different, since those who stay long should not have encountered a lot of opportunities to leave. So clearly, they did not have much claims. If someone has a long exposure, the negative sign in the output above means that he should not have much claims, on average.

As we can see, those models produce rather difference outputs. Note that it is possible much more interpretations. For instance, depending on the way data were extracted,

• all policies observed, over those twenty years,
• all policies in force at some specific date, until now
• all policies in force at some specific date, until one year after
• all policies in force now

So far, we have been using the first method, but the other ones will yield different interpretations, e.g. because of survivor bias. But that’s another story… And one can read Boucher and Denuit (2008) to go further.

Arthur Charpentier

Arthur Charpentier, professor in Montréal, in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1.  Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...